[ 
https://issues.apache.org/jira/browse/NIFI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051837#comment-17051837
 ] 

Chris Sampson commented on NIFI-7145:
-------------------------------------

Re-trying this same flow in NiFi 1.11.3 appears to be working, so maybe this 
was related to NIFI-7114 (at a guess)?

Could be there's nothing further to do with this ticket, or maybe it's still 
worth diagnosing further and confirming there's not something still hanging 
around that would be worth addressing?

> Chained SplitText processors unable to handle files in some circumstances
> -------------------------------------------------------------------------
>
>                 Key: NIFI-7145
>                 URL: https://issues.apache.org/jira/browse/NIFI-7145
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.11.1
>         Environment: Docker Image (apache/nifi) running in Kubernetes (1.15)
>            Reporter: Chris Sampson
>            Priority: Minor
>         Attachments: Broken_SplitText.json, Broken_SplitText.xml, Screen Shot 
> 2020-02-13 at 17.28.58.png, nifi-app.log, test.csv.tgz
>
>
> With chained SplitText processors (NiFi 1.11.1 apache/nifi Docker image with 
> default nifi.properties, although configured to allow secure access in my 
> environment with encrypted flowfile/provenance/content repositories, don't 
> know whether that makes a difference): * ingest 40MB CSV file with 50k lines 
> of data (plus 1 header)
>  * SplitText - chunk the file into 10k segments (including header in each 
> file)
>  * SplitText - break each row out into its own FlowFile
>  
>  The 10k chunking works fine, but then the files sit in the queue between the 
> processors forever with the second SplitText sat showing it’s working but 
> never actually produces anything (can’t see anything in the logs, although 
> haven’t turned on debug logging to see whether that would provide anything 
> more).
>   
>  If I reduce the chunk size to 1k then the per-row split works fine - maybe 
> some sort of issue with SplitText and/or swapping of FlowFiles/content to the 
> repositories? Similarly, trying to same with a smaller file (i.e. just 
> include the first 3 columns from teh attached, but keep the 50k rows) seems 
> to work fine too.
>   
>  Example Flow/Template attached with file that breaks the flow (untar and 
> copy into /tmp). Second SplitText set to Concurrency=3 in the template, but 
> fails just the same when set to default Concurrency=1.
>   
>  SplitRecord would be an alternative (which works fine when I try it), but I 
> can’t use that as we potentially lose data if the CSV is malformed (there are 
> more data fields in a row that defined headers - the extra fields are thrown 
> away by the Record processors, which I understand to be normal and that’s 
> fine, but unfortunately I later need to ValidateRecord for each of these rows 
> to check for this kind of invalidity).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to