Chase,

I want to understand the use case better before I try to offer any advice.

So you want to write the FlowFiles to a directory, and then run an external 
script to process those files, correct?
Then, once the script has run, what does it do with the result? Does it write 
it to a file, write to standard out, 
interact directly with the database, etc?

Thanks
-Mark

----------------------------------------
> Date: Mon, 22 Jun 2015 15:06:47 -0500
> From: [email protected]
> To: [email protected]
> Subject: Re: Extracting text using RegEx
>
> so i have nifi pulling in data in .txt format from about 30 different
> sites....that data gets dumped to a directory call feedfiles...then i
> have a script that will parse out the ip's, exe's, domains, etc..so that
> the parsed stuff can be allocated to a database for indexing...
>
> having trouble automating this activity from the nifi standpoint...help
> is appreciated.
>
> On 6/22/15 2:55 PM, Mark Payne wrote:
>> Chase,
>>
>> You could certainly use the ExecuteStreamCommand processor to accomplish 
>> that.
>>
>> You can see the usage guide/documentation for that processor at [1]. Give 
>> that a look and
>> let me know if it meets your needs or not.
>>
>> Thanks
>> -Mark
>>
>> [1] 
>> http://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html
>>
>>
>> ----------------------------------------
>>> Date: Mon, 22 Jun 2015 14:21:00 -0500
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: Re: Extracting text using RegEx
>>>
>>> how can one run a script within NIFI to accomplish parsing?
>>>
>>> On 6/22/15 12:41 PM, Mark Payne wrote:
>>>> Srujan,
>>>>
>>>> My guess is that the issue you are seeing is due to the GetHTTP caching 
>>>> the ETag/LastModified value. When the
>>>> processor receives the response for an HTTP GET request, it writes the 
>>>> ETag to conf/.httpCache-<processor id>.
>>>>
>>>> It does this so that even after a restart of nifi, we don't keep pulling 
>>>> the same content. If the content changes at any
>>>> point, it will pull the new version of the content, though.
>>>>
>>>> You could trigger it to pull data either by copying and pasting the 
>>>> GetHTTP Processor and letting the new processor
>>>> pull the data, or you could delete that file from the conf/ directory and 
>>>> restart.
>>>>
>>>> If this doesn't give you what you need, please feel free to let me know!
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>> ----------------------------------------
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> Subject: RE: Extracting text using RegEx
>>>>> Date: Mon, 22 Jun 2015 15:11:18 +0000
>>>>>
>>>>> Mark,
>>>>>
>>>>> How can I rerun the processors after changing some of the attributes? For 
>>>>> example, when I change the Regex pattern and start the processors, 
>>>>> nothing happens.
>>>>>
>>>>> Srujan Kotikela
>>>>> FireHost - SECURE CLOUD HOSTING
>>>>> North America | Europe | Asia Pacific
>>>>>
>>>>> ComputerWorld: 100 Best Places to Work in IT See Current Opportunities
>>>>>
>>>>> This email and any files transmitted with it are confidential and 
>>>>> intended solely
>>>>> for the use of the individual(s) to whom they are addressed. Do not 
>>>>> disseminate,
>>>>> distribute or copy this e-mail without explicit permission to do so. 
>>>>> Thank you.
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Payne [mailto:[email protected]]
>>>>> Sent: Thursday, June 18, 2015 1:22 PM
>>>>> To: [email protected]
>>>>> Subject: RE: Extracting text using RegEx
>>>>>
>>>>> Srujan,
>>>>>
>>>>> When you pull the file via GetHTTP, it assigns a filename to the file. 
>>>>> You can easily change the filename by using an UpdateAttribute Processor. 
>>>>> Just add a new property with the name "filename" and whatever value you 
>>>>> would like. Then, you can write both to the same directory.
>>>>>
>>>>> With ExtractText, it will route the FlowFile to 'matched' or 'unmatched' 
>>>>> depending on whether or not any regex that you provided matches. However, 
>>>>> if the regex has a capturing group, the text that is extracted will be 
>>>>> just what is captured by that group. For example, if your regex is 
>>>>> ".*good-(bye).*" then it will route any FlowFIle containing "good-bye"
>>>>> to 'matched' but will extract only the text "bye" because that is what is 
>>>>> in the capturing group.
>>>>>
>>>>> Once you have extracted the text, though, it is added to a FlowFile 
>>>>> attribute, not the content. So you will want to use a ReplaceText to 
>>>>> replace the content of the FlowFile before you use PutFile.
>>>>>
>>>>> Does this make sense? If not, please let me know where I can help 
>>>>> clarify, and I'll be happy to do so!
>>>>>
>>>>> Thanks
>>>>> -Mark
>>>>>
>>>>> ----------------------------------------
>>>>>> From: [email protected]
>>>>>> To: [email protected]
>>>>>> Subject: RE: Extracting text using RegEx
>>>>>> Date: Thu, 18 Jun 2015 18:08:58 +0000
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> I am trying to extract some text from a remote file/feed, downloaded via 
>>>>>> HTTP. The flow I am contemplating is like this:
>>>>>>
>>>>>> GetHTTP ====> ExtractText == (matched) ==> PutFile
>>>>>> ||
>>>>>> (unmatched)
>>>>>> ||
>>>>>> V
>>>>>> PutFile
>>>>>>
>>>>>> I am able to create this flow just fine. However, I have following 
>>>>>> issues:
>>>>>>
>>>>>> 1. I noticed that the 'file' configured for the GetHTTP processor goes 
>>>>>> into the 'directory' configured in the 'PutFile' processor. This is 
>>>>>> leading me to save the matched file and unmatched file in separate 
>>>>>> directories. Is there way to have those 2 files in the same directory?
>>>>>>
>>>>>> 2. I don't seem to get the RegEx working. The ExtractText processor 
>>>>>> either matches all input or no input. Are there any particular 
>>>>>> guidelines on how to write regex for NiFi?
>>>>>>
>>>>>> Thanks,
>>>>>> Srujan Kotikela
>>>>>> FireHost - SECURE CLOUD HOSTING
>>>>>> North America | Europe | Asia Pacific
>>>>>>
>>>>>> ComputerWorld: 100 Best Places to Work in IT See Current Opportunities
>>>>>>
>>>>>> This email and any files transmitted with it are confidential and
>>>>>> intended solely for the use of the individual(s) to whom they are
>>>>>> addressed. Do not disseminate, distribute or copy this e-mail without 
>>>>>> explicit permission to do so. Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mark Payne [mailto:[email protected]]
>>>>>> Sent: Tuesday, June 16, 2015 7:11 PM
>>>>>> To: [email protected]
>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>
>>>>>> Srujan,
>>>>>>
>>>>>> I'm not sure how familiar you are with NiFi, so just a very quick note 
>>>>>> about terminology to make sure you understand what i'm describing. A 
>>>>>> FlowFile is the basic data record in NiFi. It consists of two parts:
>>>>>> - FlowFile Attributes (Key/Value Pairs that are strings)
>>>>>> - FlowFile Content (arbitrary stream of bytes)
>>>>>>
>>>>>> I think the flow that you would want would like this:
>>>>>>
>>>>>> GetHTTP -> ExtractText -> ReplaceText -> PutFile
>>>>>>
>>>>>> ExtractText will then evaluate the regex against the content pulled from 
>>>>>> the HTTP service and put the result in a FlowFile Attribute. So let's 
>>>>>> say you add a property named "desired.text" with a value 
>>>>>> "<body>(.*)</body>". This will create an Attribute named "desired.text" 
>>>>>> and the value of that attribute will be whatever is found between the 
>>>>>> <body> and </body> tags.
>>>>>>
>>>>>> We will then use ReplaceText with the following configuration:
>>>>>> Regular Expression: .+
>>>>>> Replacement Value: ${desired.text}
>>>>>> All other properties: defaults.
>>>>>>
>>>>>> So what this is doing is replacing the content of the FlowFile with the 
>>>>>> "desired.text" attribute.
>>>>>>
>>>>>> PutFile then writes the file to disk.
>>>>>>
>>>>>> Hope this helps! If this doesn't work out for you for some reason, or if 
>>>>>> you've got more questions (or if I misunderstood what you're wanting to 
>>>>>> do), please don't hesitate to shoot back and let me know!
>>>>>>
>>>>>> Thanks
>>>>>> -Mark
>>>>>>
>>>>>> ________________________________
>>>>>>> From: [email protected]
>>>>>>> To: [email protected]
>>>>>>> CC: [email protected]
>>>>>>> Subject: Extracting text using RegEx
>>>>>>> Date: Tue, 16 Jun 2015 17:56:38 +0000
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am trying to download a file (using GetHTTP) from a website and
>>>>>>> extract text from it matching a RegEx pattern (using ExtractText).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am able to download the file using GetHTTP and save it via PutFile.
>>>>>>> I understand that ExtractText processor works only with a FlowFile.
>>>>>>> So I tried generating a flow file from GetHTTP and PutFile
>>>>>>> (separately), but it doesn't seem to work.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can anyone give me pointers (examples?) on what processors to be used
>>>>>>> to extract text from a file pulled down by GetHTTP and write the
>>>>>>> matched text to a separate file?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Srujan Kotikela
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Firehost - SECURE CLOUD HOSTING
>>>>>>> North America | Europe | Asia Pacific
>>>>>>>
>>>>>>> ComputerWorld: 100 Best Places to Work in IT ­ See Current
>>>>>>> Opportunities
>>>>>>>
>>>>>>> <http://www.firehost.com/careers>This email and any files transmitted
>>>>>>> with it are confidential and intended solely for the use of the
>>>>>>> individual(s) to whom they are addressed. Do not disseminate,
>>>>>>> distribute or copy this e-mail without explicit permission to do so.
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>> --
>>> Dr. Chase C Cunningham
>>> CTRC (SW) USN Ret.
>>> The Cynja LLC Proprietary Business and Technical Information
>>> CONFIDENTIAL TREATMENT REQUIRED
>>>
>>
>
> --
> Dr. Chase C Cunningham
> CTRC (SW) USN Ret.
> The Cynja LLC Proprietary Business and Technical Information
> CONFIDENTIAL TREATMENT REQUIRED
>
                                          

Reply via email to