Re: How to cherry pick a specific line from a flowfile?

Matt Burgess Thu, 09 Feb 2023 16:18:48 -0800

I’m AFK ATM but Range Sampling was added into the SampleRecord processor 
(https://issues.apache.org/jira/browse/NIFI-9814), the Jira doesn’t say which 
version it went into but it is definitely in 1.19.1+. If that’s available to 
you then you can just specify “2” as the range and it will only return that 
line.


For total record count without loading the whole thing into memory, there’s 
probably a more efficient way but you could use ConvertRecord and convert it 
from CSV to CSV and it should write out the “record.count” attribute. I think 
some/most/all record processors write this attribute, and they work record by 
record so they don’t load the whole thing into memory. Even SampleRecord adds a 
record.count attribute but if you specify one line the value will be 1 :)

Regards,
Matt


> On Feb 9, 2023, at 6:57 PM, James McMahon <jsmcmah...@gmail.com> wrote:
> 
> 
> Hello. I am trying to identify a header line and a data line count from a 
> flowfile that is in csv format.
> 
> Most of us are familiar with Matt B's outstanding Cookbook series, and I am 
> trying to use that as my starting point. Here is my Groovy code:
> 
> import org.apache.commons.io.IOUtils
> import java.nio.charset.StandardCharsets
> def ff=session.get()
> if(!ff)return
> try {
>      def text = ''
>      // Cast a closure with an inputStream parameter to InputStreamCallback
>      session.read(ff, {inputStream ->
>           text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
>           // Do something with text here
>           // get header from the second line of the flowfile
>           // set datacount as the total line count of the file - 2 
>           ...
>           ff = session.putAttribute(ff, 'mdb.table.header', header)
>           ff = session.putAttribute(ff, 'mdb.table.datarecords', datacount)
>      } as InputStreamCallback)
>      session.transfer(flowFile, REL_SUCCESS)
> } catch(e) {
>      log.error('Error occurred identifying tables in mdb file', e)
>      session.transfer(ff, REL_FAILURE)
> }
> 
> I want to avoid using that line in red, because as Matt cautions in his 
> cookbook, our csv files are too large. I do not want to read in the entire 
> file to variable text. It's going to be a problem.
> 
> How in Groovy can I cherry pick only the line I want from the stream (line #2 
> in this case)?
> 
> Also, how can I get a count of the total lines without loading them all into 
> text?
> 
> Thanks in advance for your help.

Re: How to cherry pick a specific line from a flowfile?

Reply via email to