1.  nifi does http stuff to get text files
2.  files are put in directory in .txt format
3.  script runs to parse through files, each data point of value is parsed
4.  parsed data is written to files associated with data points inside
5.  data is sent to data repo for future indexing and use



On 6/22/15 3:22 PM, Mark Payne wrote:
Chase,

I want to understand the use case better before I try to offer any advice.

So you want to write the FlowFiles to a directory, and then run an external 
script to process those files, correct?
Then, once the script has run, what does it do with the result? Does it write 
it to a file, write to standard out,
interact directly with the database, etc?

Thanks
-Mark

----------------------------------------
Date: Mon, 22 Jun 2015 15:06:47 -0500
From: [email protected]
To: [email protected]
Subject: Re: Extracting text using RegEx

so i have nifi pulling in data in .txt format from about 30 different
sites....that data gets dumped to a directory call feedfiles...then i
have a script that will parse out the ip's, exe's, domains, etc..so that
the parsed stuff can be allocated to a database for indexing...

having trouble automating this activity from the nifi standpoint...help
is appreciated.

On 6/22/15 2:55 PM, Mark Payne wrote:
Chase,

You could certainly use the ExecuteStreamCommand processor to accomplish that.

You can see the usage guide/documentation for that processor at [1]. Give that 
a look and
let me know if it meets your needs or not.

Thanks
-Mark

[1] 
http://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html


----------------------------------------
Date: Mon, 22 Jun 2015 14:21:00 -0500
From: [email protected]
To: [email protected]
Subject: Re: Extracting text using RegEx

how can one run a script within NIFI to accomplish parsing?

On 6/22/15 12:41 PM, Mark Payne wrote:
Srujan,

My guess is that the issue you are seeing is due to the GetHTTP caching the 
ETag/LastModified value. When the
processor receives the response for an HTTP GET request, it writes the ETag to 
conf/.httpCache-<processor id>.

It does this so that even after a restart of nifi, we don't keep pulling the 
same content. If the content changes at any
point, it will pull the new version of the content, though.

You could trigger it to pull data either by copying and pasting the GetHTTP 
Processor and letting the new processor
pull the data, or you could delete that file from the conf/ directory and 
restart.

If this doesn't give you what you need, please feel free to let me know!

Thanks
-Mark

----------------------------------------
From: [email protected]
To: [email protected]
Subject: RE: Extracting text using RegEx
Date: Mon, 22 Jun 2015 15:11:18 +0000

Mark,

How can I rerun the processors after changing some of the attributes? For 
example, when I change the Regex pattern and start the processors, nothing 
happens.

Srujan Kotikela
FireHost - SECURE CLOUD HOSTING
North America | Europe | Asia Pacific

ComputerWorld: 100 Best Places to Work in IT See Current Opportunities

This email and any files transmitted with it are confidential and intended 
solely
for the use of the individual(s) to whom they are addressed. Do not disseminate,
distribute or copy this e-mail without explicit permission to do so. Thank you.



-----Original Message-----
From: Mark Payne [mailto:[email protected]]
Sent: Thursday, June 18, 2015 1:22 PM
To: [email protected]
Subject: RE: Extracting text using RegEx

Srujan,

When you pull the file via GetHTTP, it assigns a filename to the file. You can easily 
change the filename by using an UpdateAttribute Processor. Just add a new property with 
the name "filename" and whatever value you would like. Then, you can write both 
to the same directory.

With ExtractText, it will route the FlowFile to 'matched' or 'unmatched' depending on whether or 
not any regex that you provided matches. However, if the regex has a capturing group, the text that 
is extracted will be just what is captured by that group. For example, if your regex is 
".*good-(bye).*" then it will route any FlowFIle containing "good-bye"
to 'matched' but will extract only the text "bye" because that is what is in 
the capturing group.

Once you have extracted the text, though, it is added to a FlowFile attribute, 
not the content. So you will want to use a ReplaceText to replace the content 
of the FlowFile before you use PutFile.

Does this make sense? If not, please let me know where I can help clarify, and 
I'll be happy to do so!

Thanks
-Mark

----------------------------------------
From: [email protected]
To: [email protected]
Subject: RE: Extracting text using RegEx
Date: Thu, 18 Jun 2015 18:08:58 +0000

Hi Mark,

I am trying to extract some text from a remote file/feed, downloaded via HTTP. 
The flow I am contemplating is like this:

GetHTTP ====> ExtractText == (matched) ==> PutFile
||
(unmatched)
||
V
PutFile

I am able to create this flow just fine. However, I have following issues:

1. I noticed that the 'file' configured for the GetHTTP processor goes into the 
'directory' configured in the 'PutFile' processor. This is leading me to save 
the matched file and unmatched file in separate directories. Is there way to 
have those 2 files in the same directory?

2. I don't seem to get the RegEx working. The ExtractText processor either 
matches all input or no input. Are there any particular guidelines on how to 
write regex for NiFi?

Thanks,
Srujan Kotikela
FireHost - SECURE CLOUD HOSTING
North America | Europe | Asia Pacific

ComputerWorld: 100 Best Places to Work in IT See Current Opportunities

This email and any files transmitted with it are confidential and
intended solely for the use of the individual(s) to whom they are
addressed. Do not disseminate, distribute or copy this e-mail without explicit 
permission to do so. Thank you.



-----Original Message-----
From: Mark Payne [mailto:[email protected]]
Sent: Tuesday, June 16, 2015 7:11 PM
To: [email protected]
Subject: RE: Extracting text using RegEx

Srujan,

I'm not sure how familiar you are with NiFi, so just a very quick note about 
terminology to make sure you understand what i'm describing. A FlowFile is the 
basic data record in NiFi. It consists of two parts:
- FlowFile Attributes (Key/Value Pairs that are strings)
- FlowFile Content (arbitrary stream of bytes)

I think the flow that you would want would like this:

GetHTTP -> ExtractText -> ReplaceText -> PutFile

ExtractText will then evaluate the regex against the content pulled from the HTTP service and put the result in a FlowFile Attribute. 
So let's say you add a property named "desired.text" with a value "<body>(.*)</body>". This will 
create an Attribute named "desired.text" and the value of that attribute will be whatever is found between the <body> 
and </body> tags.

We will then use ReplaceText with the following configuration:
Regular Expression: .+
Replacement Value: ${desired.text}
All other properties: defaults.

So what this is doing is replacing the content of the FlowFile with the 
"desired.text" attribute.

PutFile then writes the file to disk.

Hope this helps! If this doesn't work out for you for some reason, or if you've 
got more questions (or if I misunderstood what you're wanting to do), please 
don't hesitate to shoot back and let me know!

Thanks
-Mark

________________________________
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Extracting text using RegEx
Date: Tue, 16 Jun 2015 17:56:38 +0000


Hi,



I am trying to download a file (using GetHTTP) from a website and
extract text from it matching a RegEx pattern (using ExtractText).



I am able to download the file using GetHTTP and save it via PutFile.
I understand that ExtractText processor works only with a FlowFile.
So I tried generating a flow file from GetHTTP and PutFile
(separately), but it doesn't seem to work.



Can anyone give me pointers (examples?) on what processors to be used
to extract text from a file pulled down by GetHTTP and write the
matched text to a separate file?



Thanks,

Srujan Kotikela



Firehost - SECURE CLOUD HOSTING
North America | Europe | Asia Pacific

ComputerWorld: 100 Best Places to Work in IT ­ See Current
Opportunities

<http://www.firehost.com/careers>This email and any files transmitted
with it are confidential and intended solely for the use of the
individual(s) to whom they are addressed. Do not disseminate,
distribute or copy this e-mail without explicit permission to do so.
Thank you.






--
Dr. Chase C Cunningham
CTRC (SW) USN Ret.
The Cynja LLC Proprietary Business and Technical Information
CONFIDENTIAL TREATMENT REQUIRED

--
Dr. Chase C Cunningham
CTRC (SW) USN Ret.
The Cynja LLC Proprietary Business and Technical Information
CONFIDENTIAL TREATMENT REQUIRED

                                        

--
Dr. Chase C Cunningham
CTRC (SW) USN Ret.
The Cynja LLC Proprietary Business and Technical Information
CONFIDENTIAL TREATMENT REQUIRED

Reply via email to