[
https://issues.apache.org/jira/browse/NIFI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pierre Villard resolved NIFI-7145.
----------------------------------
Resolution: Feedback Received
Apache NiFi 1.x is no longer maintained and no new release is planned on the
1.x release line. Marking as resolved as part of a cleanup operation. Please
open a new one with an updated description if this is still relevant for NiFi
2.x.
> Chained SplitText processors unable to handle files in some circumstances
> -------------------------------------------------------------------------
>
> Key: NIFI-7145
> URL: https://issues.apache.org/jira/browse/NIFI-7145
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.11.1
> Environment: Docker Image (apache/nifi) running in Kubernetes (1.15)
> Reporter: Chris Sampson
> Priority: Minor
> Attachments: Broken_SplitText.json, Broken_SplitText.xml, Screen Shot
> 2020-02-13 at 17.28.58.png, nifi-app.log, test.csv.tgz
>
>
> With chained SplitText processors (NiFi 1.11.1 apache/nifi Docker image with
> default nifi.properties, although configured to allow secure access in my
> environment with encrypted flowfile/provenance/content repositories, don't
> know whether that makes a difference): * ingest 40MB CSV file with 50k lines
> of data (plus 1 header)
> * SplitText - chunk the file into 10k segments (including header in each
> file)
> * SplitText - break each row out into its own FlowFile
>
> The 10k chunking works fine, but then the files sit in the queue between the
> processors forever with the second SplitText sat showing it’s working but
> never actually produces anything (can’t see anything in the logs, although
> haven’t turned on debug logging to see whether that would provide anything
> more).
>
> If I reduce the chunk size to 1k then the per-row split works fine - maybe
> some sort of issue with SplitText and/or swapping of FlowFiles/content to the
> repositories? Similarly, trying to same with a smaller file (i.e. just
> include the first 3 columns from teh attached, but keep the 50k rows) seems
> to work fine too.
>
> Example Flow/Template attached with file that breaks the flow (untar and
> copy into /tmp). Second SplitText set to Concurrency=3 in the template, but
> fails just the same when set to default Concurrency=1.
>
> SplitRecord would be an alternative (which works fine when I try it), but I
> can’t use that as we potentially lose data if the CSV is malformed (there are
> more data fields in a row that defined headers - the extra fields are thrown
> away by the Record processors, which I understand to be normal and that’s
> fine, but unfortunately I later need to ValidateRecord for each of these rows
> to check for this kind of invalidity).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)