Thanks for the clarification.
The ExecuteStreamCommand processor that I was suggesting expects that the data
could can be streamed
directly to the script that it is running. The next version of NiFI
(0.2.0-incubating) provides the ability to avoid
streaming data to Standard In. This change is available today if you are
building from the codebase. If you are
just downloading the newest build, it is likely a couple of weeks away from
being delivered.
With that change, you can use PutFile -> ExecuteStreamCommand so that you write
the file to disk, and then
use ExecuteStreamCommand to call the script that parses the data. You can then
use the ${filename} as one
of the parameters to the script in order to tell it which file to run against.
From there, you can use GetFile to pick up
the result, if you want to bring it back into your NiFi flow, or you can
process it however makes sense outside
of NiFi.
Until that change is available, it may be a little more difficult, as the
processor wants to stream the content of
the FlowFile directly to the script.
A possible workaround in the meantime would be to use PutFile -> ReplaceText ->
ExecuteStreamCommand and
configure ReplaceText to replace the regex ".*" with an empty value. In that
case, it won't stream any data
to the script, and you can just invoke the script using the filename as a
parameter.
Does this help at all?
Thanks
-Mark
----------------------------------------
> Date: Mon, 22 Jun 2015 15:28:17 -0500
> From: [email protected]
> To: [email protected]
> Subject: Re: Extracting text using RegEx
>
> 1. nifi does http stuff to get text files
> 2. files are put in directory in .txt format
> 3. script runs to parse through files, each data point of value is parsed
> 4. parsed data is written to files associated with data points inside
> 5. data is sent to data repo for future indexing and use
>
>
>
> On 6/22/15 3:22 PM, Mark Payne wrote:
>> Chase,
>>
>> I want to understand the use case better before I try to offer any advice.
>>
>> So you want to write the FlowFiles to a directory, and then run an external
>> script to process those files, correct?
>> Then, once the script has run, what does it do with the result? Does it
>> write it to a file, write to standard out,
>> interact directly with the database, etc?
>>
>> Thanks
>> -Mark
>>
>> ----------------------------------------
>>> Date: Mon, 22 Jun 2015 15:06:47 -0500
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: Re: Extracting text using RegEx
>>>
>>> so i have nifi pulling in data in .txt format from about 30 different
>>> sites....that data gets dumped to a directory call feedfiles...then i
>>> have a script that will parse out the ip's, exe's, domains, etc..so that
>>> the parsed stuff can be allocated to a database for indexing...
>>>
>>> having trouble automating this activity from the nifi standpoint...help
>>> is appreciated.
>>>
>>> On 6/22/15 2:55 PM, Mark Payne wrote:
>>>> Chase,
>>>>
>>>> You could certainly use the ExecuteStreamCommand processor to accomplish
>>>> that.
>>>>
>>>> You can see the usage guide/documentation for that processor at [1]. Give
>>>> that a look and
>>>> let me know if it meets your needs or not.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>> [1]
>>>> http://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html
>>>>
>>>>
>>>> ----------------------------------------
>>>>> Date: Mon, 22 Jun 2015 14:21:00 -0500
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> Subject: Re: Extracting text using RegEx
>>>>>
>>>>> how can one run a script within NIFI to accomplish parsing?
>>>>>
>>>>> On 6/22/15 12:41 PM, Mark Payne wrote:
>>>>>> Srujan,
>>>>>>
>>>>>> My guess is that the issue you are seeing is due to the GetHTTP caching
>>>>>> the ETag/LastModified value. When the
>>>>>> processor receives the response for an HTTP GET request, it writes the
>>>>>> ETag to conf/.httpCache-<processor id>.
>>>>>>
>>>>>> It does this so that even after a restart of nifi, we don't keep pulling
>>>>>> the same content. If the content changes at any
>>>>>> point, it will pull the new version of the content, though.
>>>>>>
>>>>>> You could trigger it to pull data either by copying and pasting the
>>>>>> GetHTTP Processor and letting the new processor
>>>>>> pull the data, or you could delete that file from the conf/ directory
>>>>>> and restart.
>>>>>>
>>>>>> If this doesn't give you what you need, please feel free to let me know!
>>>>>>
>>>>>> Thanks
>>>>>> -Mark
>>>>>>
>>>>>> ----------------------------------------
>>>>>>> From: [email protected]
>>>>>>> To: [email protected]
>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>> Date: Mon, 22 Jun 2015 15:11:18 +0000
>>>>>>>
>>>>>>> Mark,
>>>>>>>
>>>>>>> How can I rerun the processors after changing some of the attributes?
>>>>>>> For example, when I change the Regex pattern and start the processors,
>>>>>>> nothing happens.
>>>>>>>
>>>>>>> Srujan Kotikela
>>>>>>> FireHost - SECURE CLOUD HOSTING
>>>>>>> North America | Europe | Asia Pacific
>>>>>>>
>>>>>>> ComputerWorld: 100 Best Places to Work in IT See Current Opportunities
>>>>>>>
>>>>>>> This email and any files transmitted with it are confidential and
>>>>>>> intended solely
>>>>>>> for the use of the individual(s) to whom they are addressed. Do not
>>>>>>> disseminate,
>>>>>>> distribute or copy this e-mail without explicit permission to do so.
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Mark Payne [mailto:[email protected]]
>>>>>>> Sent: Thursday, June 18, 2015 1:22 PM
>>>>>>> To: [email protected]
>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>
>>>>>>> Srujan,
>>>>>>>
>>>>>>> When you pull the file via GetHTTP, it assigns a filename to the file.
>>>>>>> You can easily change the filename by using an UpdateAttribute
>>>>>>> Processor. Just add a new property with the name "filename" and
>>>>>>> whatever value you would like. Then, you can write both to the same
>>>>>>> directory.
>>>>>>>
>>>>>>> With ExtractText, it will route the FlowFile to 'matched' or
>>>>>>> 'unmatched' depending on whether or not any regex that you provided
>>>>>>> matches. However, if the regex has a capturing group, the text that is
>>>>>>> extracted will be just what is captured by that group. For example, if
>>>>>>> your regex is ".*good-(bye).*" then it will route any FlowFIle
>>>>>>> containing "good-bye"
>>>>>>> to 'matched' but will extract only the text "bye" because that is what
>>>>>>> is in the capturing group.
>>>>>>>
>>>>>>> Once you have extracted the text, though, it is added to a FlowFile
>>>>>>> attribute, not the content. So you will want to use a ReplaceText to
>>>>>>> replace the content of the FlowFile before you use PutFile.
>>>>>>>
>>>>>>> Does this make sense? If not, please let me know where I can help
>>>>>>> clarify, and I'll be happy to do so!
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Mark
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>> From: [email protected]
>>>>>>>> To: [email protected]
>>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>> Date: Thu, 18 Jun 2015 18:08:58 +0000
>>>>>>>>
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> I am trying to extract some text from a remote file/feed, downloaded
>>>>>>>> via HTTP. The flow I am contemplating is like this:
>>>>>>>>
>>>>>>>> GetHTTP ====> ExtractText == (matched) ==> PutFile
>>>>>>>> ||
>>>>>>>> (unmatched)
>>>>>>>> ||
>>>>>>>> V
>>>>>>>> PutFile
>>>>>>>>
>>>>>>>> I am able to create this flow just fine. However, I have following
>>>>>>>> issues:
>>>>>>>>
>>>>>>>> 1. I noticed that the 'file' configured for the GetHTTP processor goes
>>>>>>>> into the 'directory' configured in the 'PutFile' processor. This is
>>>>>>>> leading me to save the matched file and unmatched file in separate
>>>>>>>> directories. Is there way to have those 2 files in the same directory?
>>>>>>>>
>>>>>>>> 2. I don't seem to get the RegEx working. The ExtractText processor
>>>>>>>> either matches all input or no input. Are there any particular
>>>>>>>> guidelines on how to write regex for NiFi?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Srujan Kotikela
>>>>>>>> FireHost - SECURE CLOUD HOSTING
>>>>>>>> North America | Europe | Asia Pacific
>>>>>>>>
>>>>>>>> ComputerWorld: 100 Best Places to Work in IT See Current Opportunities
>>>>>>>>
>>>>>>>> This email and any files transmitted with it are confidential and
>>>>>>>> intended solely for the use of the individual(s) to whom they are
>>>>>>>> addressed. Do not disseminate, distribute or copy this e-mail without
>>>>>>>> explicit permission to do so. Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mark Payne [mailto:[email protected]]
>>>>>>>> Sent: Tuesday, June 16, 2015 7:11 PM
>>>>>>>> To: [email protected]
>>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>>
>>>>>>>> Srujan,
>>>>>>>>
>>>>>>>> I'm not sure how familiar you are with NiFi, so just a very quick note
>>>>>>>> about terminology to make sure you understand what i'm describing. A
>>>>>>>> FlowFile is the basic data record in NiFi. It consists of two parts:
>>>>>>>> - FlowFile Attributes (Key/Value Pairs that are strings)
>>>>>>>> - FlowFile Content (arbitrary stream of bytes)
>>>>>>>>
>>>>>>>> I think the flow that you would want would like this:
>>>>>>>>
>>>>>>>> GetHTTP -> ExtractText -> ReplaceText -> PutFile
>>>>>>>>
>>>>>>>> ExtractText will then evaluate the regex against the content pulled
>>>>>>>> from the HTTP service and put the result in a FlowFile Attribute. So
>>>>>>>> let's say you add a property named "desired.text" with a value
>>>>>>>> "<body>(.*)</body>". This will create an Attribute named
>>>>>>>> "desired.text" and the value of that attribute will be whatever is
>>>>>>>> found between the <body> and </body> tags.
>>>>>>>>
>>>>>>>> We will then use ReplaceText with the following configuration:
>>>>>>>> Regular Expression: .+
>>>>>>>> Replacement Value: ${desired.text}
>>>>>>>> All other properties: defaults.
>>>>>>>>
>>>>>>>> So what this is doing is replacing the content of the FlowFile with
>>>>>>>> the "desired.text" attribute.
>>>>>>>>
>>>>>>>> PutFile then writes the file to disk.
>>>>>>>>
>>>>>>>> Hope this helps! If this doesn't work out for you for some reason, or
>>>>>>>> if you've got more questions (or if I misunderstood what you're
>>>>>>>> wanting to do), please don't hesitate to shoot back and let me know!
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -Mark
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>>> From: [email protected]
>>>>>>>>> To: [email protected]
>>>>>>>>> CC: [email protected]
>>>>>>>>> Subject: Extracting text using RegEx
>>>>>>>>> Date: Tue, 16 Jun 2015 17:56:38 +0000
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am trying to download a file (using GetHTTP) from a website and
>>>>>>>>> extract text from it matching a RegEx pattern (using ExtractText).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am able to download the file using GetHTTP and save it via PutFile.
>>>>>>>>> I understand that ExtractText processor works only with a FlowFile.
>>>>>>>>> So I tried generating a flow file from GetHTTP and PutFile
>>>>>>>>> (separately), but it doesn't seem to work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can anyone give me pointers (examples?) on what processors to be used
>>>>>>>>> to extract text from a file pulled down by GetHTTP and write the
>>>>>>>>> matched text to a separate file?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Srujan Kotikela
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Firehost - SECURE CLOUD HOSTING
>>>>>>>>> North America | Europe | Asia Pacific
>>>>>>>>>
>>>>>>>>> ComputerWorld: 100 Best Places to Work in IT See Current
>>>>>>>>> Opportunities
>>>>>>>>>
>>>>>>>>> <http://www.firehost.com/careers>This email and any files transmitted
>>>>>>>>> with it are confidential and intended solely for the use of the
>>>>>>>>> individual(s) to whom they are addressed. Do not disseminate,
>>>>>>>>> distribute or copy this e-mail without explicit permission to do so.
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> --
>>>>> Dr. Chase C Cunningham
>>>>> CTRC (SW) USN Ret.
>>>>> The Cynja LLC Proprietary Business and Technical Information
>>>>> CONFIDENTIAL TREATMENT REQUIRED
>>>>>
>>> --
>>> Dr. Chase C Cunningham
>>> CTRC (SW) USN Ret.
>>> The Cynja LLC Proprietary Business and Technical Information
>>> CONFIDENTIAL TREATMENT REQUIRED
>>>
>>
>
> --
> Dr. Chase C Cunningham
> CTRC (SW) USN Ret.
> The Cynja LLC Proprietary Business and Technical Information
> CONFIDENTIAL TREATMENT REQUIRED
>