Re: Question on setting up nifi flow

2016-04-28 Thread Susheel Kumar
Thanks Pierre, Simon and Bryan.  Let me take a look and come back with few
more questions

On Thu, Apr 28, 2016 at 11:32 AM, Simon Ball  wrote:

> GetMongo is an ingest only processor, so cannot accept and input flow
> file. It also only has a success relation.
>
> A solution to this would be to use NiFi’s own deduplication.
>
> One Flow would seed values in the distributed cache by using GetMongo to
> pull the ids and PutDistributedMapCache to store them in NiFi’s cache.
>
> The main ingest flow would then use UpdateAttributes to create a
> hash.value that matched the values inserted to the cache ->
> DetectDuplicates -> flow to PutMongo (use the upset property) -success->
> PutSolrContentStream
>
> Simon
>
> On Apr 28, 2016, at 5:19 PM, Pierre Villard 
> wrote:
>
> Hi Susheel,
>
> 1. HandleHttpRequest
> 2. RouteOnAttribute + HandleHttpResponse in case of errors detected in
> headers
> 3. Depending of what you want, there are a lot of options to handle JSON
> data (EvaluateJsonPath will probably useful)
> 4. GetMongo (I think it will route on success in case there is an entry,
> and to failure if there is no record, but this has to be checked, otherwise
> an addional processor will do the job to check the result of the request).
> 5. & 6. PutMongo + PutFile (if local folder) + PutSolr (if you want to do
> Solr by yourself).
>
> Depending of the details, this could be slightly different, but I think it
> gives a good idea of the minimal set of processor you would need.
>
> HTH,
> Pierre
>
>
> 2016-04-28 16:54 GMT+02:00 Susheel Kumar :
>
>> Hi,
>>
>> After attending meetup in NYC, I am realizing NiFi can be used for the
>> data flow use case I have.  Can someone please share the steps/processors
>> necessary for below use case.
>>
>>
>>1. Receive JSON on a HTTP REST end point
>>2. Parse Http Header and do validation. Return Error code & messages
>>as JSON to the response in case of validation failures
>>3. Parse request JSON, perform various validations (missing data in
>>fields), massages some data, add some data
>>4. Check if the request JSON unique ID is present in MongoDB and
>>compare timestamp to validate if this is an update request or a new 
>> request
>>5. If new request, an entry is made in mongo and then JSON files are
>>written to output folder for another process to pick up and submit to 
>> Solr.
>>6. If update request, mongo record is updated and JSON files are
>>written to output folder
>>
>>
>> I understand that something like HandleHttpRequest Processor can be used
>> for receiving http request and then use PutSolrContentStream for writing to
>> Solr but not clear on what processors will be used for validation etc.
>> steps 2 thru 5 above.
>>
>> Appreciate your input.
>>
>> Thanks,
>> Susheel
>>
>>
>>
>>
>>
>
>


Re: Doing development on nifi

2016-04-28 Thread Joe Percivall
Hello Stéphane,

Just adding on to Matt's and Andy's answers, Andy mentioned Provenance[1] for 
replaying events but I also find it very useful for debugging processors/flows 
as well. Data Provenance is a core feature of NiFi and it allows you to see 
exactly what the FlowFile looked like (attributes and content) before and after 
a processor acted on it as well as the ability to see a map of the journey that 
FlowFile underwent through your flow. The easiest way to see the provenance of 
a processor is to right click on it and then click "Data provenance".

The documentation below should be a great introduction and if you have any 
questions feel free to ask!
 
[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data-provenance


Joe
- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joeperciv...@yahoo.com



On Thursday, April 28, 2016 7:30 PM, Matt Burgess  wrote:



Stéphane,

Welcome to NiFi, glad to have you aboard!  May I ask what version you
are using? I believe as of at least 0.6.0, you can view the items in a
queued connection. So for your example, you can have a GetHttp into a
SplitJson, but don't start the SplitJson, just the GetHttp. You will
see any flowfiles generated by GetHttp queued up in the success (or
response?) connection (whichever you have wired to SplitJson). Then
you can right-click on the connection (the line between the
processors) and choose List Queue. In that dialog you can choose an
element by clicking on the Info icon ('i' in a circle) and see the
information about it, including a View button for the content.

The best part is that you don't have to do a "preview" run, then a
"real" run. The data is in the connection's queue, so you can make
alterations to your SplitJson, then start it to see if it works. If it
doesn't, stop it and start the GetHttp again (if stopped) to put more
data in the queue.  For fine-grained debugging, you can temporarily
set the Run schedule for the SplitJson to something like 10 seconds,
then when you start it, it will likely only bring in one flow file, so
you can react to how it works, then stop it before it empties the
queue.

I hope that makes sense, I apologize in advance if I made things more
confusing. The good news is there is a solution to your problem, even
if I am not the right person to describe it :)

Cheers,
Matt


On Thu, Apr 28, 2016 at 7:06 PM, Stéphane Maarek
 wrote:
> Hi,
>
> I'm very new to nifi and love the concept. As part of the process, I'm
> learning. My biggest frustration is that I can't see the data flowing
> through the system as I do development.
>
> Maybe I missed an article or a link, but is it possible to view the data
> while in the flow? I.e. Say I create a get http, I'd like it to fire once,
> get some data so I can see what it looks like. Then if I do a split json,
> I'd like to see if my output of it is what I expected or if I somehow messed
> up, etc etc
>
> I hope my question is clear
>
> Thanks in advance,
> Stéphane


Re: Doing development on nifi

2016-04-28 Thread Matt Burgess
Stéphane,

Welcome to NiFi, glad to have you aboard!  May I ask what version you
are using? I believe as of at least 0.6.0, you can view the items in a
queued connection. So for your example, you can have a GetHttp into a
SplitJson, but don't start the SplitJson, just the GetHttp. You will
see any flowfiles generated by GetHttp queued up in the success (or
response?) connection (whichever you have wired to SplitJson). Then
you can right-click on the connection (the line between the
processors) and choose List Queue. In that dialog you can choose an
element by clicking on the Info icon ('i' in a circle) and see the
information about it, including a View button for the content.

The best part is that you don't have to do a "preview" run, then a
"real" run. The data is in the connection's queue, so you can make
alterations to your SplitJson, then start it to see if it works. If it
doesn't, stop it and start the GetHttp again (if stopped) to put more
data in the queue.  For fine-grained debugging, you can temporarily
set the Run schedule for the SplitJson to something like 10 seconds,
then when you start it, it will likely only bring in one flow file, so
you can react to how it works, then stop it before it empties the
queue.

I hope that makes sense, I apologize in advance if I made things more
confusing. The good news is there is a solution to your problem, even
if I am not the right person to describe it :)

Cheers,
Matt

On Thu, Apr 28, 2016 at 7:06 PM, Stéphane Maarek
 wrote:
> Hi,
>
> I'm very new to nifi and love the concept. As part of the process, I'm
> learning. My biggest frustration is that I can't see the data flowing
> through the system as I do development.
>
> Maybe I missed an article or a link, but is it possible to view the data
> while in the flow? I.e. Say I create a get http, I'd like it to fire once,
> get some data so I can see what it looks like. Then if I do a split json,
> I'd like to see if my output of it is what I expected or if I somehow messed
> up, etc etc
>
> I hope my question is clear
>
> Thanks in advance,
> Stéphane


Doing development on nifi

2016-04-28 Thread Stéphane Maarek
Hi,

I'm very new to nifi and love the concept. As part of the process, I'm
learning. My biggest frustration is that I can't see the data flowing
through the system as I do development.

Maybe I missed an article or a link, but is it possible to view the data
while in the flow? I.e. Say I create a get http, I'd like it to fire once,
get some data so I can see what it looks like. Then if I do a split json,
I'd like to see if my output of it is what I expected or if I somehow
messed up, etc etc

I hope my question is clear

Thanks in advance,
Stéphane


Re: Is it possible to call a HIVE table from a ExecuteScript Processor?

2016-04-28 Thread Matt Burgess
The PR is out: https://github.com/apache/nifi/pull/384

I have a couple of minor changes to make but I think it should be
merged by the end of the day tomorrow.

On Thu, Apr 28, 2016 at 2:50 PM, Mike Harding  wrote:
> Hi Matt,
>
> Thanks for the info - do you have an idea when you plan to issue the PR for
> this?
>
>
> Cheers,
> Mike
>
> On Tue, 26 Apr 2016 at 14:47, Matt Burgess  wrote:
>>
>> Hive doesn't work with ExecuteSQL as its JDBC driver does not support
>> all the JDBC API calls made by ExecuteSQL / PutSQL.  However I am
>> working on a Hive NAR to include ExecuteHiveQL and PutHiveQL
>> processors (https://issues.apache.org/jira/browse/NIFI-981), there is
>> a prototype pull request on GitHub
>> (https://github.com/apache/nifi/pull/372) if you'd like to try them
>> out. I am currently adding support for Kerberos and finishing up, then
>> will issue a new PR for the processors.
>>
>> To use ExecuteScript in the meantime, you've got a couple of options
>> after downloading the driver and all its dependencies (or better yet,
>> the single "fat JAR"):
>>
>> 1) Add the location of the JAR(s) to the Module Directory property of
>> the ExecuteScript dialog. You will have to create your own connection,
>> if you're using Groovy then its Sql facility is quite nice
>>
>> (http://www.schibsted.pl/2015/06/groovy-sql-an-easy-way-to-database-scripting/)
>>
>> 2) Create a Database Connection Pool configured to point at the JAR(s)
>> and use the Hive driver (org.apache.hive.jdbc.HiveDriver). Then you
>> can get a connection from there and continue on with Groovy SQL (for
>> example). I have a blog post about this:
>> http://funnifi.blogspot.com/2016/04/sql-in-nifi-with-executescript.html
>>
>> Regards,
>> Matt
>>
>> On Tue, Apr 26, 2016 at 8:07 AM, Pierre Villard
>>  wrote:
>> > Hi Mike,
>> >
>> > I never tried but using the JDBC client you should be able to query your
>> > Hive table using ExecuteSQL processor.
>> >
>> > Hope that helps,
>> > Pierre
>> >
>> >
>> > 2016-04-26 13:53 GMT+02:00 Mike Harding :
>> >>
>> >> Hi All,
>> >>
>> >> I have a requirement to access a lookup Hive table to translate a code
>> >> number in a FlowFile to a readable name. I'm just unsure how trivial it
>> >> is
>> >> to connect to the db from an ExecuteScript processor?
>> >>
>> >> Nifi and the hiveserver2 sit on the same node so I'm wondering if its
>> >> possible to use HiveServer2's JDBC client
>> >>
>> >> (https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC)
>> >> without any issues?
>> >>
>> >> Thanks in advance,
>> >> Mike
>> >
>> >


Re: Refer to original FlowFile after AttributesToJSON processor

2016-04-28 Thread Igor Kravzov
Thanks Andrew.  I tried to store data as property. But AtrributesToJson
coverts all properties, or if you you put some property to exclude from
conversion, it disappears from file after conversion and you can't
reference it. Unless I did something wrong.

On Thu, Apr 28, 2016 at 11:28 AM, Andrew Grande 
wrote:

> Igor, this is a common pattern for such systems, you need to 'save' old
> content in advance. E.g. as a property of the message, then you can
> transform the content and still have access to the original value (via an
> attribute value on the message).
>
> Keep in mind, though, that you don't want to put large values into
> attributes. Alternatively you would have to design you solution so that you
> could store the content somewhere explicitly and look it up (e.g. have
> my_content_old & my_content_new as lookup keys). NiFi's DistributedCache
> facility might be a good fit for such data.
>
> Andrew
>
> From: Igor Kravzov 
> Reply-To: "users@nifi.apache.org" 
> Date: Monday, April 25, 2016 at 9:58 PM
> To: "users@nifi.apache.org" 
> Subject: Refer to original FlowFile after AttributesToJSON processor
>
> Is there a way to refer to original FlowFile after AttributesToJSON  call?
> Destination set to flowfile-content.
> Or is there a way to produce a FlowFile from an attribute?
>
> I need to extract some properties from Twitter JSON, convert to a new
> JSON, and after to extract another property from original JSON.
>
> I can't do it before because this property will be included in a newly
> generated JSON.
>
> I am still battling with extracting "entities" entry from Twitter JSON. I
> also extract "text" from tweet and found out if I just blindly replace
> escaped JSON after conversion it brakes JSON in case when "text" contains
> quotes.
>
> So my idea is to put some kind of placeholder parameter and replace it
> after conversion with original "entity" value.
>
> Hope I explained problem good enough.
>
> Thanks in dvance.
>
>


Re: Refer to original FlowFile after AttributesToJSON processor

2016-04-28 Thread Andrew Grande
Igor, this is a common pattern for such systems, you need to 'save' old content 
in advance. E.g. as a property of the message, then you can transform the 
content and still have access to the original value (via an attribute value on 
the message).

Keep in mind, though, that you don't want to put large values into attributes. 
Alternatively you would have to design you solution so that you could store the 
content somewhere explicitly and look it up (e.g. have my_content_old & 
my_content_new as lookup keys). NiFi's DistributedCache facility might be a 
good fit for such data.

Andrew

From: Igor Kravzov >
Reply-To: "users@nifi.apache.org" 
>
Date: Monday, April 25, 2016 at 9:58 PM
To: "users@nifi.apache.org" 
>
Subject: Refer to original FlowFile after AttributesToJSON processor

Is there a way to refer to original FlowFile after AttributesToJSON  call? 
Destination set to flowfile-content.
Or is there a way to produce a FlowFile from an attribute?

I need to extract some properties from Twitter JSON, convert to a new JSON, and 
after to extract another property from original JSON.

I can't do it before because this property will be included in a newly 
generated JSON.

I am still battling with extracting "entities" entry from Twitter JSON. I also 
extract "text" from tweet and found out if I just blindly replace escaped JSON 
after conversion it brakes JSON in case when "text" contains quotes.

So my idea is to put some kind of placeholder parameter and replace it after 
conversion with original "entity" value.

Hope I explained problem good enough.

Thanks in dvance.


Re: Question on setting up nifi flow

2016-04-28 Thread Bryan Bende
Hi Susheel,

In addition to what Pierre mentioned, if you are interested in an example
of using HandleHttpRequest/Response, there is a template in this repository:

https://github.com/hortonworks-gallery/nifi-templates

The template is HttpExecuteLsCommand.xml and shows how to build a web
service in NiFi that performs a directory listing.

-Bryan


On Thu, Apr 28, 2016 at 11:19 AM, Pierre Villard <
pierre.villard...@gmail.com> wrote:

> Hi Susheel,
>
> 1. HandleHttpRequest
> 2. RouteOnAttribute + HandleHttpResponse in case of errors detected in
> headers
> 3. Depending of what you want, there are a lot of options to handle JSON
> data (EvaluateJsonPath will probably useful)
> 4. GetMongo (I think it will route on success in case there is an entry,
> and to failure if there is no record, but this has to be checked, otherwise
> an addional processor will do the job to check the result of the request).
> 5. & 6. PutMongo + PutFile (if local folder) + PutSolr (if you want to do
> Solr by yourself).
>
> Depending of the details, this could be slightly different, but I think it
> gives a good idea of the minimal set of processor you would need.
>
> HTH,
> Pierre
>
>
> 2016-04-28 16:54 GMT+02:00 Susheel Kumar :
>
>> Hi,
>>
>> After attending meetup in NYC, I am realizing NiFi can be used for the
>> data flow use case I have.  Can someone please share the steps/processors
>> necessary for below use case.
>>
>>
>>1. Receive JSON on a HTTP REST end point
>>2. Parse Http Header and do validation. Return Error code & messages
>>as JSON to the response in case of validation failures
>>3. Parse request JSON, perform various validations (missing data in
>>fields), massages some data, add some data
>>4. Check if the request JSON unique ID is present in MongoDB and
>>compare timestamp to validate if this is an update request or a new 
>> request
>>5. If new request, an entry is made in mongo and then JSON files are
>>written to output folder for another process to pick up and submit to 
>> Solr.
>>6. If update request, mongo record is updated and JSON files are
>>written to output folder
>>
>>
>> I understand that something like HandleHttpRequest Processor can be used
>> for receiving http request and then use PutSolrContentStream for writing to
>> Solr but not clear on what processors will be used for validation etc.
>> steps 2 thru 5 above.
>>
>> Appreciate your input.
>>
>> Thanks,
>> Susheel
>>
>>
>>
>>
>>
>


Re: NiFi - Configuration option

2016-04-28 Thread Andrew Grande
Manish,

Maybe not available as a variable necessarily (this will come later), but you 
can apply properties when you deploy your template. Basically, set them 
directly as part of the deployment. Take a look 
https://github.com/aperepel/nifi-api-deploy

Andrew

From: Manish Gupta 8 >
Reply-To: "users@nifi.apache.org" 
>
Date: Thursday, April 28, 2016 at 3:08 AM
To: "users@nifi.apache.org" 
>
Subject: NiFi - Configuration option

Hi,

What is the best option for storing root/processor group level configurations? 
From “Expression Language Guide”, I know one can use environment variable or 
JVM system property. Or specify the value for one flow using an update 
attribute processor near the top of the flow.

But is there a way I can have a single property / xml file to have my all the 
configurations available and each property be available in nifi as a variable?

Thanks,
Manish


Re: GetMongo question

2016-04-28 Thread Simon Ball
Not at present, unless you customize the query attribute yourself based on the 
results of the previous query, which is possible but a little fiddly.

A better solution would be to add an incremental processor for Mongo, similar 
to QueryDatabaseTable, which uses the NiFi State storage mechanism to maintain 
latest timestamps. Given the way Mongo ObjectIDs embed a timestamp, this could 
be done quite neatly.

Would you be interested on working on contributing something like that?

The first step would be to put in an enhancement ticket at 
https://issues.apache.org/jira/browse/NIFI

Simon

On Apr 27, 2016, at 10:15 PM, nadine giampapa 
> wrote:

I am using the getMongo processor but every time it runs, it is returning 
results from the beginning of time. I have my processor schedule to run every 
ten minutes. I would like it to return only the new results since the last time 
the processor ran. Is there a way to do that?



NiFi - Configuration option

2016-04-28 Thread Manish Gupta 8
Hi,

What is the best option for storing root/processor group level configurations? 
From "Expression Language Guide", I know one can use environment variable or 
JVM system property. Or specify the value for one flow using an update 
attribute processor near the top of the flow.

But is there a way I can have a single property / xml file to have my all the 
configurations available and each property be available in nifi as a variable?

Thanks,
Manish