Chase, I want to understand the use case better before I try to offer any advice.
So you want to write the FlowFiles to a directory, and then run an external script to process those files, correct? Then, once the script has run, what does it do with the result? Does it write it to a file, write to standard out, interact directly with the database, etc? Thanks -Mark ---------------------------------------- > Date: Mon, 22 Jun 2015 15:06:47 -0500 > From: [email protected] > To: [email protected] > Subject: Re: Extracting text using RegEx > > so i have nifi pulling in data in .txt format from about 30 different > sites....that data gets dumped to a directory call feedfiles...then i > have a script that will parse out the ip's, exe's, domains, etc..so that > the parsed stuff can be allocated to a database for indexing... > > having trouble automating this activity from the nifi standpoint...help > is appreciated. > > On 6/22/15 2:55 PM, Mark Payne wrote: >> Chase, >> >> You could certainly use the ExecuteStreamCommand processor to accomplish >> that. >> >> You can see the usage guide/documentation for that processor at [1]. Give >> that a look and >> let me know if it meets your needs or not. >> >> Thanks >> -Mark >> >> [1] >> http://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html >> >> >> ---------------------------------------- >>> Date: Mon, 22 Jun 2015 14:21:00 -0500 >>> From: [email protected] >>> To: [email protected] >>> Subject: Re: Extracting text using RegEx >>> >>> how can one run a script within NIFI to accomplish parsing? >>> >>> On 6/22/15 12:41 PM, Mark Payne wrote: >>>> Srujan, >>>> >>>> My guess is that the issue you are seeing is due to the GetHTTP caching >>>> the ETag/LastModified value. When the >>>> processor receives the response for an HTTP GET request, it writes the >>>> ETag to conf/.httpCache-<processor id>. >>>> >>>> It does this so that even after a restart of nifi, we don't keep pulling >>>> the same content. If the content changes at any >>>> point, it will pull the new version of the content, though. >>>> >>>> You could trigger it to pull data either by copying and pasting the >>>> GetHTTP Processor and letting the new processor >>>> pull the data, or you could delete that file from the conf/ directory and >>>> restart. >>>> >>>> If this doesn't give you what you need, please feel free to let me know! >>>> >>>> Thanks >>>> -Mark >>>> >>>> ---------------------------------------- >>>>> From: [email protected] >>>>> To: [email protected] >>>>> Subject: RE: Extracting text using RegEx >>>>> Date: Mon, 22 Jun 2015 15:11:18 +0000 >>>>> >>>>> Mark, >>>>> >>>>> How can I rerun the processors after changing some of the attributes? For >>>>> example, when I change the Regex pattern and start the processors, >>>>> nothing happens. >>>>> >>>>> Srujan Kotikela >>>>> FireHost - SECURE CLOUD HOSTING >>>>> North America | Europe | Asia Pacific >>>>> >>>>> ComputerWorld: 100 Best Places to Work in IT See Current Opportunities >>>>> >>>>> This email and any files transmitted with it are confidential and >>>>> intended solely >>>>> for the use of the individual(s) to whom they are addressed. Do not >>>>> disseminate, >>>>> distribute or copy this e-mail without explicit permission to do so. >>>>> Thank you. >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Mark Payne [mailto:[email protected]] >>>>> Sent: Thursday, June 18, 2015 1:22 PM >>>>> To: [email protected] >>>>> Subject: RE: Extracting text using RegEx >>>>> >>>>> Srujan, >>>>> >>>>> When you pull the file via GetHTTP, it assigns a filename to the file. >>>>> You can easily change the filename by using an UpdateAttribute Processor. >>>>> Just add a new property with the name "filename" and whatever value you >>>>> would like. Then, you can write both to the same directory. >>>>> >>>>> With ExtractText, it will route the FlowFile to 'matched' or 'unmatched' >>>>> depending on whether or not any regex that you provided matches. However, >>>>> if the regex has a capturing group, the text that is extracted will be >>>>> just what is captured by that group. For example, if your regex is >>>>> ".*good-(bye).*" then it will route any FlowFIle containing "good-bye" >>>>> to 'matched' but will extract only the text "bye" because that is what is >>>>> in the capturing group. >>>>> >>>>> Once you have extracted the text, though, it is added to a FlowFile >>>>> attribute, not the content. So you will want to use a ReplaceText to >>>>> replace the content of the FlowFile before you use PutFile. >>>>> >>>>> Does this make sense? If not, please let me know where I can help >>>>> clarify, and I'll be happy to do so! >>>>> >>>>> Thanks >>>>> -Mark >>>>> >>>>> ---------------------------------------- >>>>>> From: [email protected] >>>>>> To: [email protected] >>>>>> Subject: RE: Extracting text using RegEx >>>>>> Date: Thu, 18 Jun 2015 18:08:58 +0000 >>>>>> >>>>>> Hi Mark, >>>>>> >>>>>> I am trying to extract some text from a remote file/feed, downloaded via >>>>>> HTTP. The flow I am contemplating is like this: >>>>>> >>>>>> GetHTTP ====> ExtractText == (matched) ==> PutFile >>>>>> || >>>>>> (unmatched) >>>>>> || >>>>>> V >>>>>> PutFile >>>>>> >>>>>> I am able to create this flow just fine. However, I have following >>>>>> issues: >>>>>> >>>>>> 1. I noticed that the 'file' configured for the GetHTTP processor goes >>>>>> into the 'directory' configured in the 'PutFile' processor. This is >>>>>> leading me to save the matched file and unmatched file in separate >>>>>> directories. Is there way to have those 2 files in the same directory? >>>>>> >>>>>> 2. I don't seem to get the RegEx working. The ExtractText processor >>>>>> either matches all input or no input. Are there any particular >>>>>> guidelines on how to write regex for NiFi? >>>>>> >>>>>> Thanks, >>>>>> Srujan Kotikela >>>>>> FireHost - SECURE CLOUD HOSTING >>>>>> North America | Europe | Asia Pacific >>>>>> >>>>>> ComputerWorld: 100 Best Places to Work in IT See Current Opportunities >>>>>> >>>>>> This email and any files transmitted with it are confidential and >>>>>> intended solely for the use of the individual(s) to whom they are >>>>>> addressed. Do not disseminate, distribute or copy this e-mail without >>>>>> explicit permission to do so. Thank you. >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Mark Payne [mailto:[email protected]] >>>>>> Sent: Tuesday, June 16, 2015 7:11 PM >>>>>> To: [email protected] >>>>>> Subject: RE: Extracting text using RegEx >>>>>> >>>>>> Srujan, >>>>>> >>>>>> I'm not sure how familiar you are with NiFi, so just a very quick note >>>>>> about terminology to make sure you understand what i'm describing. A >>>>>> FlowFile is the basic data record in NiFi. It consists of two parts: >>>>>> - FlowFile Attributes (Key/Value Pairs that are strings) >>>>>> - FlowFile Content (arbitrary stream of bytes) >>>>>> >>>>>> I think the flow that you would want would like this: >>>>>> >>>>>> GetHTTP -> ExtractText -> ReplaceText -> PutFile >>>>>> >>>>>> ExtractText will then evaluate the regex against the content pulled from >>>>>> the HTTP service and put the result in a FlowFile Attribute. So let's >>>>>> say you add a property named "desired.text" with a value >>>>>> "<body>(.*)</body>". This will create an Attribute named "desired.text" >>>>>> and the value of that attribute will be whatever is found between the >>>>>> <body> and </body> tags. >>>>>> >>>>>> We will then use ReplaceText with the following configuration: >>>>>> Regular Expression: .+ >>>>>> Replacement Value: ${desired.text} >>>>>> All other properties: defaults. >>>>>> >>>>>> So what this is doing is replacing the content of the FlowFile with the >>>>>> "desired.text" attribute. >>>>>> >>>>>> PutFile then writes the file to disk. >>>>>> >>>>>> Hope this helps! If this doesn't work out for you for some reason, or if >>>>>> you've got more questions (or if I misunderstood what you're wanting to >>>>>> do), please don't hesitate to shoot back and let me know! >>>>>> >>>>>> Thanks >>>>>> -Mark >>>>>> >>>>>> ________________________________ >>>>>>> From: [email protected] >>>>>>> To: [email protected] >>>>>>> CC: [email protected] >>>>>>> Subject: Extracting text using RegEx >>>>>>> Date: Tue, 16 Jun 2015 17:56:38 +0000 >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am trying to download a file (using GetHTTP) from a website and >>>>>>> extract text from it matching a RegEx pattern (using ExtractText). >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am able to download the file using GetHTTP and save it via PutFile. >>>>>>> I understand that ExtractText processor works only with a FlowFile. >>>>>>> So I tried generating a flow file from GetHTTP and PutFile >>>>>>> (separately), but it doesn't seem to work. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Can anyone give me pointers (examples?) on what processors to be used >>>>>>> to extract text from a file pulled down by GetHTTP and write the >>>>>>> matched text to a separate file? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Srujan Kotikela >>>>>>> >>>>>>> >>>>>>> >>>>>>> Firehost - SECURE CLOUD HOSTING >>>>>>> North America | Europe | Asia Pacific >>>>>>> >>>>>>> ComputerWorld: 100 Best Places to Work in IT See Current >>>>>>> Opportunities >>>>>>> >>>>>>> <http://www.firehost.com/careers>This email and any files transmitted >>>>>>> with it are confidential and intended solely for the use of the >>>>>>> individual(s) to whom they are addressed. Do not disseminate, >>>>>>> distribute or copy this e-mail without explicit permission to do so. >>>>>>> Thank you. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>> -- >>> Dr. Chase C Cunningham >>> CTRC (SW) USN Ret. >>> The Cynja LLC Proprietary Business and Technical Information >>> CONFIDENTIAL TREATMENT REQUIRED >>> >> > > -- > Dr. Chase C Cunningham > CTRC (SW) USN Ret. > The Cynja LLC Proprietary Business and Technical Information > CONFIDENTIAL TREATMENT REQUIRED >
