[ https://issues.apache.org/jira/browse/NIFI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051837#comment-17051837 ]
Chris Sampson commented on NIFI-7145: ------------------------------------- Re-trying this same flow in NiFi 1.11.3 appears to be working, so maybe this was related to NIFI-7114 (at a guess)? Could be there's nothing further to do with this ticket, or maybe it's still worth diagnosing further and confirming there's not something still hanging around that would be worth addressing? > Chained SplitText processors unable to handle files in some circumstances > ------------------------------------------------------------------------- > > Key: NIFI-7145 > URL: https://issues.apache.org/jira/browse/NIFI-7145 > Project: Apache NiFi > Issue Type: Bug > Affects Versions: 1.11.1 > Environment: Docker Image (apache/nifi) running in Kubernetes (1.15) > Reporter: Chris Sampson > Priority: Minor > Attachments: Broken_SplitText.json, Broken_SplitText.xml, Screen Shot > 2020-02-13 at 17.28.58.png, nifi-app.log, test.csv.tgz > > > With chained SplitText processors (NiFi 1.11.1 apache/nifi Docker image with > default nifi.properties, although configured to allow secure access in my > environment with encrypted flowfile/provenance/content repositories, don't > know whether that makes a difference): * ingest 40MB CSV file with 50k lines > of data (plus 1 header) > * SplitText - chunk the file into 10k segments (including header in each > file) > * SplitText - break each row out into its own FlowFile > > The 10k chunking works fine, but then the files sit in the queue between the > processors forever with the second SplitText sat showing it’s working but > never actually produces anything (can’t see anything in the logs, although > haven’t turned on debug logging to see whether that would provide anything > more). > > If I reduce the chunk size to 1k then the per-row split works fine - maybe > some sort of issue with SplitText and/or swapping of FlowFiles/content to the > repositories? Similarly, trying to same with a smaller file (i.e. just > include the first 3 columns from teh attached, but keep the 50k rows) seems > to work fine too. > > Example Flow/Template attached with file that breaks the flow (untar and > copy into /tmp). Second SplitText set to Concurrency=3 in the template, but > fails just the same when set to default Concurrency=1. > > SplitRecord would be an alternative (which works fine when I try it), but I > can’t use that as we potentially lose data if the CSV is malformed (there are > more data fields in a row that defined headers - the extra fields are thrown > away by the Record processors, which I understand to be normal and that’s > fine, but unfortunately I later need to ValidateRecord for each of these rows > to check for this kind of invalidity). -- This message was sent by Atlassian Jira (v8.3.4#803005)