[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16301480#comment-16301480 ] ASF subversion and git services commented on NIFI-4496: --- Commit 14d2291db87d8ea160f538c10de31ac69fc996ae in nifi's branch refs/heads/master from [~ca9mbu] [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=14d2291 ] NIFI-4496: Added JacksonCSVRecordReader to allow choice of CSV parser. This closes #2245. > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess > Fix For: 1.5.0 > > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16301481#comment-16301481 ] ASF GitHub Bot commented on NIFI-4496: -- Github user asfgit closed the pull request at: https://github.com/apache/nifi/pull/2245 > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess > Fix For: 1.5.0 > > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298635#comment-16298635 ] ASF GitHub Bot commented on NIFI-4496: -- Github user mattyb149 commented on a diff in the pull request: https://github.com/apache/nifi/pull/2245#discussion_r158051846 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java --- @@ -0,0 +1,257 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nifi.csv; + +import com.fasterxml.jackson.databind.MappingIterator; +import com.fasterxml.jackson.databind.ObjectReader; +import com.fasterxml.jackson.dataformat.csv.CsvMapper; +import com.fasterxml.jackson.dataformat.csv.CsvParser; +import com.fasterxml.jackson.dataformat.csv.CsvSchema; +import org.apache.commons.csv.CSVFormat; +import org.apache.commons.io.input.BOMInputStream; +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.MapRecord; +import org.apache.nifi.serialization.record.Record; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.serialization.record.util.DataTypeUtils; + +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.io.Reader; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.function.Supplier; + + +public class JacksonCSVRecordReader implements RecordReader { +private final RecordSchema schema; + +private final Supplier LAZY_DATE_FORMAT; +private final Supplier LAZY_TIME_FORMAT; +private final Supplier LAZY_TIMESTAMP_FORMAT; + +private final ComponentLog logger; +private final boolean hasHeader; +private final boolean ignoreHeader; +private final MappingIterator recordStream; +private List rawFieldNames = null; + +private volatile static CsvMapper mapper = new CsvMapper().enable(CsvParser.Feature.WRAP_AS_ARRAY); + +public JacksonCSVRecordReader(final InputStream in, final ComponentLog logger, final RecordSchema schema, final CSVFormat csvFormat, final boolean hasHeader, final boolean ignoreHeader, + final String dateFormat, final String timeFormat, final String timestampFormat, final String encoding) throws IOException { + +this.schema = schema; +this.logger = logger; +this.hasHeader = hasHeader; +this.ignoreHeader = ignoreHeader; +final DateFormat df = dateFormat == null ? null : DataTypeUtils.getDateFormat(dateFormat); +final DateFormat tf = timeFormat == null ? null : DataTypeUtils.getDateFormat(timeFormat); +final DateFormat tsf = timestampFormat == null ? null : DataTypeUtils.getDateFormat(timestampFormat); + +LAZY_DATE_FORMAT = () -> df; +LAZY_TIME_FORMAT = () -> tf; +LAZY_TIMESTAMP_FORMAT = () -> tsf; + +final Reader reader = new InputStreamReader(new BOMInputStream(in)); + +CsvSchema.Builder csvSchemaBuilder = CsvSchema.builder() +.setColumnSeparator(csvFormat.getDelimiter()) +.setLineSeparator(csvFormat.getRecordSeparator()) +// Can only use comments in Jackson CSV if the correct marker is set +.setAllowComments("#" .equals(CharUti
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298633#comment-16298633 ] ASF GitHub Bot commented on NIFI-4496: -- Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/2245#discussion_r158051319 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java --- @@ -0,0 +1,257 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nifi.csv; + +import com.fasterxml.jackson.databind.MappingIterator; +import com.fasterxml.jackson.databind.ObjectReader; +import com.fasterxml.jackson.dataformat.csv.CsvMapper; +import com.fasterxml.jackson.dataformat.csv.CsvParser; +import com.fasterxml.jackson.dataformat.csv.CsvSchema; +import org.apache.commons.csv.CSVFormat; +import org.apache.commons.io.input.BOMInputStream; +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.MapRecord; +import org.apache.nifi.serialization.record.Record; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.serialization.record.util.DataTypeUtils; + +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.io.Reader; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.function.Supplier; + + +public class JacksonCSVRecordReader implements RecordReader { +private final RecordSchema schema; + +private final Supplier LAZY_DATE_FORMAT; +private final Supplier LAZY_TIME_FORMAT; +private final Supplier LAZY_TIMESTAMP_FORMAT; + +private final ComponentLog logger; +private final boolean hasHeader; +private final boolean ignoreHeader; +private final MappingIterator recordStream; +private List rawFieldNames = null; + +private volatile static CsvMapper mapper = new CsvMapper().enable(CsvParser.Feature.WRAP_AS_ARRAY); + +public JacksonCSVRecordReader(final InputStream in, final ComponentLog logger, final RecordSchema schema, final CSVFormat csvFormat, final boolean hasHeader, final boolean ignoreHeader, + final String dateFormat, final String timeFormat, final String timestampFormat, final String encoding) throws IOException { + +this.schema = schema; +this.logger = logger; +this.hasHeader = hasHeader; +this.ignoreHeader = ignoreHeader; +final DateFormat df = dateFormat == null ? null : DataTypeUtils.getDateFormat(dateFormat); +final DateFormat tf = timeFormat == null ? null : DataTypeUtils.getDateFormat(timeFormat); +final DateFormat tsf = timestampFormat == null ? null : DataTypeUtils.getDateFormat(timestampFormat); + +LAZY_DATE_FORMAT = () -> df; +LAZY_TIME_FORMAT = () -> tf; +LAZY_TIMESTAMP_FORMAT = () -> tsf; + +final Reader reader = new InputStreamReader(new BOMInputStream(in)); + +CsvSchema.Builder csvSchemaBuilder = CsvSchema.builder() +.setColumnSeparator(csvFormat.getDelimiter()) +.setLineSeparator(csvFormat.getRecordSeparator()) +// Can only use comments in Jackson CSV if the correct marker is set +.setAllowComments("#" .equals(CharUtil
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292927#comment-16292927 ] ASF GitHub Bot commented on NIFI-4496: -- Github user mattyb149 commented on a diff in the pull request: https://github.com/apache/nifi/pull/2245#discussion_r157261623 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java --- @@ -136,7 +134,7 @@ public Record nextRecord(final boolean coerceTypes, final boolean dropUnknownFie // If the first record is the header names (and we're using them), store those off for use in creating the value map on the next iterations if (rawFieldNames == null) { -if (hasHeader && ignoreHeader) { +if (!hasHeader || ignoreHeader) { rawFieldNames = schema.getFieldNames(); } else { rawFieldNames = Arrays.stream(csvRecord).map((a) -> { --- End diff -- Who knows lol. I'll try asList() instead > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292922#comment-16292922 ] ASF GitHub Bot commented on NIFI-4496: -- Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/2245#discussion_r157261143 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java --- @@ -136,7 +134,7 @@ public Record nextRecord(final boolean coerceTypes, final boolean dropUnknownFie // If the first record is the header names (and we're using them), store those off for use in creating the value map on the next iterations if (rawFieldNames == null) { -if (hasHeader && ignoreHeader) { +if (!hasHeader || ignoreHeader) { rawFieldNames = schema.getFieldNames(); } else { rawFieldNames = Arrays.stream(csvRecord).map((a) -> { --- End diff -- I'm not sure that I understand the logic here... was this perhaps due to some refactoring and got overlooked, or is this actually doing something that's just not obvious to me? Seems this could just be done as `Arrays.asList(csvRecord)` > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292819#comment-16292819 ] ASF GitHub Bot commented on NIFI-4496: -- Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/2245 @jdye64 I think I fixed the issue you were seeing. We have to do most of the schema resolution/management manually, Jackson's methods for handling that don't seem to work for what we need. So I removed the setting of column names on the parser, having the column names changed the parser to want an actual array with [] surrounding the line (weird, right?). Then for files without headers, I needed to make sure we used the schema field names, so I had to adjust the logic where "rawFieldNames" is generated. Mind taking a look at this latest version? Please and thanks! > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242089#comment-16242089 ] ASF GitHub Bot commented on NIFI-4496: -- Github user jdye64 commented on the issue: https://github.com/apache/nifi/pull/2245 @mattyb149 I'm seeing invalid output when I run run an existing flow with this PR. I had an existing flow that used ConvertRecord and Apache Commons CSV. That was working fine and giving me the output I expected. However when I switched to using the Jackson implementation all of the output was empty. I have attached a screenshot from my debugger session in hopes that will help shed some light into what is going on. https://user-images.githubusercontent.com/2127235/32498256-32f8ffc6-c39d-11e7-86dd-cde8f7d3a758.png";> > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234547#comment-16234547 ] ASF GitHub Bot commented on NIFI-4496: -- Github user andrewmlim commented on a diff in the pull request: https://github.com/apache/nifi/pull/2245#discussion_r148347696 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/CSVReader.java --- @@ -54,6 +54,26 @@ "The first non-comment line of the CSV file is a header line that contains the names of the columns. The schema will be derived by using the " + "column names in the header and assuming that all columns are of type String."); +// CSV parsers +public static final AllowableValue APACHE_COMMONS_CSV = new AllowableValue("commons-csv", "Apache Commons CSV", +"The CSV parser implementation from the Apache Commons CSV library."); + +public static final AllowableValue JACKSON_CSV = new AllowableValue("jackson-csv", "Jackson CSV", +"The CSV parser implementation from the Jackson Dataformats library"); + + +public static final PropertyDescriptor CSV_PARSER = new PropertyDescriptor.Builder() +.name("csv-reader-csv-parser") +.displayName("CSV Parser") +.description("Specifies which parser to use to read CSV records. NOTE: Different parsers may support different subsets of functionality, " ++ "and/or exhibit different levels of performance.") --- End diff -- Suggest changing the NOTE to: Different parsers may support different subsets of functionality and may also exhibit different levels of performance. > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess >Priority: Major > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234544#comment-16234544 ] ASF GitHub Bot commented on NIFI-4496: -- Github user andrewmlim commented on a diff in the pull request: https://github.com/apache/nifi/pull/2245#discussion_r148347427 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/CSVReader.java --- @@ -54,6 +54,26 @@ "The first non-comment line of the CSV file is a header line that contains the names of the columns. The schema will be derived by using the " + "column names in the header and assuming that all columns are of type String."); +// CSV parsers +public static final AllowableValue APACHE_COMMONS_CSV = new AllowableValue("commons-csv", "Apache Commons CSV", +"The CSV parser implementation from the Apache Commons CSV library."); + +public static final AllowableValue JACKSON_CSV = new AllowableValue("jackson-csv", "Jackson CSV", +"The CSV parser implementation from the Jackson Dataformats library"); --- End diff -- Need a period (.) after library to be consistent. > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess >Priority: Major > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4496) Improve performance of CSVReader
[ https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234252#comment-16234252 ] ASF GitHub Bot commented on NIFI-4496: -- GitHub user mattyb149 opened a pull request: https://github.com/apache/nifi/pull/2245 NIFI-4496: Added JacksonCSVRecordReader to allow choice of CSV parser Thank you for submitting a contribution to Apache NiFi. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with NIFI- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly? - [x] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mattyb149/nifi NIFI-4496 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/2245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2245 commit 15040f4f67a785ab16894992ffeca7d7847f62f1 Author: Matthew Burgess Date: 2017-11-01T15:50:06Z NIFI-4496: Added JacksonCSVRecordReader to allow choice of CSV parser > Improve performance of CSVReader > > > Key: NIFI-4496 > URL: https://issues.apache.org/jira/browse/NIFI-4496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions >Reporter: Matt Burgess >Assignee: Matt Burgess >Priority: Major > > During some throughput testing, it was noted that the CSVReader was not as > fast as desired, processing less than 50k records per second. A look at [this > benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that > the Apache Commons CSV parser (used by CSVReader) is quite slow compared to > others. > From that benchmark it appears that CSVReader could be enhanced by using a > different CSV parser under the hood. Perhaps Jackson is the best choice, as > it is fast when values are quoted, and is a mature and maintained codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)