[ 
https://issues.apache.org/jira/browse/NIFI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036718#comment-17036718
 ] 

Chris Sampson commented on NIFI-7145:
-------------------------------------

"Too many open files" is exactly the issue I was trying to work around by 
having the 2-phase split. Initially we just took the whole file and split to 
single lines, but this was no longer possible when the files being ingested 
were over ~65k rows. The Docker Daemon on which our NiFi pod runs currently has 
a nofile ulimit of 65536 (in the process of having this increased, but the 
multi-phase Split is still a good idea to reduce the number of open files at 
any one time, of course).

To be clear, we don't see this error (nor any others) in our logs when running 
this flow.

> Chained SplitText processors unable to handle files in some circumstances
> -------------------------------------------------------------------------
>
>                 Key: NIFI-7145
>                 URL: https://issues.apache.org/jira/browse/NIFI-7145
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.11.1
>         Environment: Docker Image (apache/nifi) running in Kubernetes (1.15)
>            Reporter: Chris Sampson
>            Priority: Minor
>         Attachments: Broken_SplitText.json, Broken_SplitText.xml, Screen Shot 
> 2020-02-13 at 17.28.58.png, nifi-app.log, test.csv.tgz
>
>
> With chained SplitText processors (NiFi 1.11.1 apache/nifi Docker image with 
> default nifi.properties, although configured to allow secure access in my 
> environment with encrypted flowfile/provenance/content repositories, don't 
> know whether that makes a difference): * ingest 40MB CSV file with 50k lines 
> of data (plus 1 header)
>  * SplitText - chunk the file into 10k segments (including header in each 
> file)
>  * SplitText - break each row out into its own FlowFile
>  
>  The 10k chunking works fine, but then the files sit in the queue between the 
> processors forever with the second SplitText sat showing it’s working but 
> never actually produces anything (can’t see anything in the logs, although 
> haven’t turned on debug logging to see whether that would provide anything 
> more).
>   
>  If I reduce the chunk size to 1k then the per-row split works fine - maybe 
> some sort of issue with SplitText and/or swapping of FlowFiles/content to the 
> repositories? Similarly, trying to same with a smaller file (i.e. just 
> include the first 3 columns from teh attached, but keep the 50k rows) seems 
> to work fine too.
>   
>  Example Flow/Template attached with file that breaks the flow (untar and 
> copy into /tmp). Second SplitText set to Concurrency=3 in the template, but 
> fails just the same when set to default Concurrency=1.
>   
>  SplitRecord would be an alternative (which works fine when I try it), but I 
> can’t use that as we potentially lose data if the CSV is malformed (there are 
> more data fields in a row that defined headers - the extra fields are thrown 
> away by the Record processors, which I understand to be normal and that’s 
> fine, but unfortunately I later need to ValidateRecord for each of these rows 
> to check for this kind of invalidity).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to