1. nifi does http stuff to get text files
2. files are put in directory in .txt format
3. script runs to parse through files, each data point of value is parsed
4. parsed data is written to files associated with data points inside
5. data is sent to data repo for future indexing and use
On 6/22/15 3:22 PM, Mark Payne wrote:
Chase,
I want to understand the use case better before I try to offer any advice.
So you want to write the FlowFiles to a directory, and then run an external
script to process those files, correct?
Then, once the script has run, what does it do with the result? Does it write
it to a file, write to standard out,
interact directly with the database, etc?
Thanks
-Mark
----------------------------------------
Date: Mon, 22 Jun 2015 15:06:47 -0500
From: [email protected]
To: [email protected]
Subject: Re: Extracting text using RegEx
so i have nifi pulling in data in .txt format from about 30 different
sites....that data gets dumped to a directory call feedfiles...then i
have a script that will parse out the ip's, exe's, domains, etc..so that
the parsed stuff can be allocated to a database for indexing...
having trouble automating this activity from the nifi standpoint...help
is appreciated.
On 6/22/15 2:55 PM, Mark Payne wrote:
Chase,
You could certainly use the ExecuteStreamCommand processor to accomplish that.
You can see the usage guide/documentation for that processor at [1]. Give that
a look and
let me know if it meets your needs or not.
Thanks
-Mark
[1]
http://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html
----------------------------------------
Date: Mon, 22 Jun 2015 14:21:00 -0500
From: [email protected]
To: [email protected]
Subject: Re: Extracting text using RegEx
how can one run a script within NIFI to accomplish parsing?
On 6/22/15 12:41 PM, Mark Payne wrote:
Srujan,
My guess is that the issue you are seeing is due to the GetHTTP caching the
ETag/LastModified value. When the
processor receives the response for an HTTP GET request, it writes the ETag to
conf/.httpCache-<processor id>.
It does this so that even after a restart of nifi, we don't keep pulling the
same content. If the content changes at any
point, it will pull the new version of the content, though.
You could trigger it to pull data either by copying and pasting the GetHTTP
Processor and letting the new processor
pull the data, or you could delete that file from the conf/ directory and
restart.
If this doesn't give you what you need, please feel free to let me know!
Thanks
-Mark
----------------------------------------
From: [email protected]
To: [email protected]
Subject: RE: Extracting text using RegEx
Date: Mon, 22 Jun 2015 15:11:18 +0000
Mark,
How can I rerun the processors after changing some of the attributes? For
example, when I change the Regex pattern and start the processors, nothing
happens.
Srujan Kotikela
FireHost - SECURE CLOUD HOSTING
North America | Europe | Asia Pacific
ComputerWorld: 100 Best Places to Work in IT See Current Opportunities
This email and any files transmitted with it are confidential and intended
solely
for the use of the individual(s) to whom they are addressed. Do not disseminate,
distribute or copy this e-mail without explicit permission to do so. Thank you.
-----Original Message-----
From: Mark Payne [mailto:[email protected]]
Sent: Thursday, June 18, 2015 1:22 PM
To: [email protected]
Subject: RE: Extracting text using RegEx
Srujan,
When you pull the file via GetHTTP, it assigns a filename to the file. You can easily
change the filename by using an UpdateAttribute Processor. Just add a new property with
the name "filename" and whatever value you would like. Then, you can write both
to the same directory.
With ExtractText, it will route the FlowFile to 'matched' or 'unmatched' depending on whether or
not any regex that you provided matches. However, if the regex has a capturing group, the text that
is extracted will be just what is captured by that group. For example, if your regex is
".*good-(bye).*" then it will route any FlowFIle containing "good-bye"
to 'matched' but will extract only the text "bye" because that is what is in
the capturing group.
Once you have extracted the text, though, it is added to a FlowFile attribute,
not the content. So you will want to use a ReplaceText to replace the content
of the FlowFile before you use PutFile.
Does this make sense? If not, please let me know where I can help clarify, and
I'll be happy to do so!
Thanks
-Mark
----------------------------------------
From: [email protected]
To: [email protected]
Subject: RE: Extracting text using RegEx
Date: Thu, 18 Jun 2015 18:08:58 +0000
Hi Mark,
I am trying to extract some text from a remote file/feed, downloaded via HTTP.
The flow I am contemplating is like this:
GetHTTP ====> ExtractText == (matched) ==> PutFile
||
(unmatched)
||
V
PutFile
I am able to create this flow just fine. However, I have following issues:
1. I noticed that the 'file' configured for the GetHTTP processor goes into the
'directory' configured in the 'PutFile' processor. This is leading me to save
the matched file and unmatched file in separate directories. Is there way to
have those 2 files in the same directory?
2. I don't seem to get the RegEx working. The ExtractText processor either
matches all input or no input. Are there any particular guidelines on how to
write regex for NiFi?
Thanks,
Srujan Kotikela
FireHost - SECURE CLOUD HOSTING
North America | Europe | Asia Pacific
ComputerWorld: 100 Best Places to Work in IT See Current Opportunities
This email and any files transmitted with it are confidential and
intended solely for the use of the individual(s) to whom they are
addressed. Do not disseminate, distribute or copy this e-mail without explicit
permission to do so. Thank you.
-----Original Message-----
From: Mark Payne [mailto:[email protected]]
Sent: Tuesday, June 16, 2015 7:11 PM
To: [email protected]
Subject: RE: Extracting text using RegEx
Srujan,
I'm not sure how familiar you are with NiFi, so just a very quick note about
terminology to make sure you understand what i'm describing. A FlowFile is the
basic data record in NiFi. It consists of two parts:
- FlowFile Attributes (Key/Value Pairs that are strings)
- FlowFile Content (arbitrary stream of bytes)
I think the flow that you would want would like this:
GetHTTP -> ExtractText -> ReplaceText -> PutFile
ExtractText will then evaluate the regex against the content pulled from the HTTP service and put the result in a FlowFile Attribute.
So let's say you add a property named "desired.text" with a value "<body>(.*)</body>". This will
create an Attribute named "desired.text" and the value of that attribute will be whatever is found between the <body>
and </body> tags.
We will then use ReplaceText with the following configuration:
Regular Expression: .+
Replacement Value: ${desired.text}
All other properties: defaults.
So what this is doing is replacing the content of the FlowFile with the
"desired.text" attribute.
PutFile then writes the file to disk.
Hope this helps! If this doesn't work out for you for some reason, or if you've
got more questions (or if I misunderstood what you're wanting to do), please
don't hesitate to shoot back and let me know!
Thanks
-Mark
________________________________
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Extracting text using RegEx
Date: Tue, 16 Jun 2015 17:56:38 +0000
Hi,
I am trying to download a file (using GetHTTP) from a website and
extract text from it matching a RegEx pattern (using ExtractText).
I am able to download the file using GetHTTP and save it via PutFile.
I understand that ExtractText processor works only with a FlowFile.
So I tried generating a flow file from GetHTTP and PutFile
(separately), but it doesn't seem to work.
Can anyone give me pointers (examples?) on what processors to be used
to extract text from a file pulled down by GetHTTP and write the
matched text to a separate file?
Thanks,
Srujan Kotikela
Firehost - SECURE CLOUD HOSTING
North America | Europe | Asia Pacific
ComputerWorld: 100 Best Places to Work in IT See Current
Opportunities
<http://www.firehost.com/careers>This email and any files transmitted
with it are confidential and intended solely for the use of the
individual(s) to whom they are addressed. Do not disseminate,
distribute or copy this e-mail without explicit permission to do so.
Thank you.
--
Dr. Chase C Cunningham
CTRC (SW) USN Ret.
The Cynja LLC Proprietary Business and Technical Information
CONFIDENTIAL TREATMENT REQUIRED
--
Dr. Chase C Cunningham
CTRC (SW) USN Ret.
The Cynja LLC Proprietary Business and Technical Information
CONFIDENTIAL TREATMENT REQUIRED
--
Dr. Chase C Cunningham
CTRC (SW) USN Ret.
The Cynja LLC Proprietary Business and Technical Information
CONFIDENTIAL TREATMENT REQUIRED