Start flow automatically on Nifi start

2015-10-15 Thread David Klim
Hello,
I have a flow combining several independent processes (different data sources, 
different data targets). I am looking for answers on the following questions, 
so far nothing on the doc. Maybe some of the experts here can help:
- I would like to start some of the flows whenever the nifi server is 
restarted. I think I can use 'nifi.flowcontroller.autoResumeState' to achieve 
that. Or would that start all processors?
- I am creating a modular architecture, and depending on the modules that get 
deployed on a given cluster, I only need part of the flow. Is there a way to 
split the flow into several parts that can be independently added?
Thanks in advance! :-)









  

Re: Start flow automatically on Nifi start

2015-10-15 Thread Andrew Grande


- I would like to start some of the flows whenever the nifi server is 
restarted. I think I can use 'nifi.flowcontroller.autoResumeState' to achieve 
that. Or would that start all processors?

Yes, that's the purpose of this property and it's enabled by default. Did you 
try it and not see expected behavior?


- I am creating a modular architecture, and depending on the modules that get 
deployed on a given cluster, I only need part of the flow. Is there a way to 
split the flow into several parts that can be independently added?

I would go with Processing groups and configure input & output ports for each.

Andrew


Re: Nifi Clustering - work distribution on workers

2015-10-15 Thread M Singh
Hi Mark:
Thanks for your answers but being a newbie I am still not clear about some 
issues:
Regarding hdfs multiple files:
Typically, if you want to pull from HDFS and partition that dataacross the 
cluster, you would run ListHDFS on the Primary Node only, and then use 
Site-to-Site [1] to distributethat listing to all nodes in the cluster. 
Question - I believe that this requires distributing the list of files to NCM 
to the other site - who will take care of distributing it to it's worker nodes. 
 Do we send the list of files to NCM as a single message and NCM will split it 
to distribute one to each of the nodes, or should we send separate messages to 
NCM and then it will send one message to each worker node ? Also, if we send a 
single list of files to NCM, does it send the same list to all it's workers ? 
If the NCM sends the same list then won't there be duplication of work ?
Regarding concurrent tasks - 
Question - How do they help in parallelizing the processing ?
Regarding passing separate arguments to workers :
Question - This is related to the above two, ie, how to partition the tasks 
across worker nodes in a cluster ?
Thanks again for your help.
Mans


 


 On Wednesday, October 14, 2015 2:08 PM, Mark Payne  
wrote:
   

 Mans,
Nodes in a cluster work independently from one another and do not know about 
each other. That is accurate.Each node in a cluster runs the same flow. 
Typically, if you want to pull from HDFS and partition that dataacross the 
cluster, you would run ListHDFS on the Primary Node only, and then use 
Site-to-Site [1] to distributethat listing to all nodes in the cluster. Each 
node would then pull the data that it is responsible to pull and beginworking 
on it. We do realize that this is not ideal to have to setup this way, and it 
is something that we are workingon so that it is much easier to have that 
listing automatically distributed across the cluster.
I'm not sure that I understand your #3 - how do we design the workflow so that 
the nodes work on one file at a time?For each Processor, you can configure how 
many threads (Concurrent Tasks) are to be used in the Scheduling tabof the 
Processor Configuration dialog. You can certainly configure that to run only a 
single Concurrent Task. This is the number of Concurrent Tasks that will run on 
each node in the cluster, not the total number of concurrenttasks that would 
run across the entire cluster.
I am not sure that I understand your #4 either. Are you indicating that you 
want to configure each node in the clusterwith a different value for a 
processor property?
Does this help?
Thanks-Mark
[1] http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site


On Oct 14, 2015, at 4:49 PM, M Singh  wrote:
Hi:



A few questions about NiFi cluster:
1. If we have multiple worker nodes in the cluster, do they partition the work 
if the source allows partitioning - eg: HDFS, or do all the nodes work on the 
same data ?2. If the nodes partition the work, then how do they coordinate the 
work distribution and recovery etc ?  From the documentation it appears that 
the workers are not aware of each other.3. If I need to process multiple files 
- how do we design the work flow so that the nodes work on one file at a time 
?4. If I have multiple arguments and need to pass one parameter to each worker, 
how can I do that ?5. Is there any way to control how many workers are involved 
in processing the flow ?6. Does specifying the number of threads in the 
processor distribute work on multiple workers ?  Does it split the task across 
the threads or is it the responsibility of the application ?
I tried to find some answers from the documentation and users list but could 
not get a clear picture.
Thanks
Mans







  

Re: Start flow automatically on Nifi start

2015-10-15 Thread Brandon DeVries
David,

To expand on the second part of Andrew's answer, you can make templates of
the Process Groups you divide your flow into so that you can add only the
ones you need to any given cluster.

Brandon

On Thu, Oct 15, 2015 at 12:16 PM, Andrew Grande 
wrote:

>
>
> - I would like to start some of the flows whenever the nifi server is
> restarted. I think I can use 'nifi.flowcontroller.autoResumeState' to
> achieve that. Or would that start all processors?
>
>
> Yes, that's the purpose of this property and it's enabled by default. Did
> you try it and not see expected behavior?
>
>
> - I am creating a modular architecture, and depending on the modules that
> get deployed on a given cluster, I only need part of the flow. Is there a
> way to split the flow into several parts that can be independently added?
>
>
> I would go with Processing groups and configure input & output ports for
> each.
>
> Andrew
>


RE: Start flow automatically on Nifi start

2015-10-15 Thread David Klim
Thanks for the answers!
So I guess I will need to create the flow programatically using Nifi REST API 
depending on the specifics of the cluster being setup, essentially:
Create processor group -> instantiate template
Will give that a try. Long term, I thing Nifi could accept several flows 
definitions in the config.
Cheers  
From: b...@jhu.edu
Date: Thu, 15 Oct 2015 12:30:19 -0400
Subject: Re: Start flow automatically on Nifi start
To: users@nifi.apache.org

David,

To expand on the second part of Andrew's answer, you can make templates of the 
Process Groups you divide your flow into so that you can add only the ones you 
need to any given cluster.

Brandon

On Thu, Oct 15, 2015 at 12:16 PM, Andrew Grande  wrote:
















- I would like to start some of the flows whenever the nifi server is 
restarted. I think I can use 'nifi.flowcontroller.autoResumeState' to achieve 
that. Or would that start all processors?








Yes, that's the purpose of this property and it's enabled by default. Did you 
try it and not see expected behavior?











- I am creating a modular architecture, and depending on the modules that get 
deployed on a given cluster, I only need part of the flow. Is there a way to 
split the flow into several parts that can be independently added?








I would go with Processing groups and configure input & output ports for each.



Andrew




  

Re: Start flow automatically on Nifi start

2015-10-15 Thread Oleg Zhurakousky
David

I think what is worth further clarifying is the fact that NiFi flow lifecycle 
is based on lifecycle of individual components (e.g., Processors) within the 
flow. In other words one can start/stop any Processor individually.
So, the ‘autoResumeState’ will not start your flow automatically, it will 
simply preserve the state of all components between restarts. This means that 
if any one of those components was stopped it will resume as stopped.

Cheers
Oleg

On Oct 15, 2015, at 1:03 PM, David Klim 
mailto:davidkl...@hotmail.com>> wrote:

Thanks for the answers!

So I guess I will need to create the flow programatically using Nifi REST API 
depending on the specifics of the cluster being setup, essentially:

Create processor group -> instantiate template

Will give that a try. Long term, I thing Nifi could accept several flows 
definitions in the config.

Cheers



From: b...@jhu.edu
Date: Thu, 15 Oct 2015 12:30:19 -0400
Subject: Re: Start flow automatically on Nifi start
To: users@nifi.apache.org

David,

To expand on the second part of Andrew's answer, you can make templates of the 
Process Groups you divide your flow into so that you can add only the ones you 
need to any given cluster.

Brandon

On Thu, Oct 15, 2015 at 12:16 PM, Andrew Grande 
mailto:agra...@hortonworks.com>> wrote:


- I would like to start some of the flows whenever the nifi server is 
restarted. I think I can use 'nifi.flowcontroller.autoResumeState' to achieve 
that. Or would that start all processors?

Yes, that's the purpose of this property and it's enabled by default. Did you 
try it and not see expected behavior?


- I am creating a modular architecture, and depending on the modules that get 
deployed on a given cluster, I only need part of the flow. Is there a way to 
split the flow into several parts that can be independently added?

I would go with Processing groups and configure input & output ports for each.

Andrew



Re: StoreInKiteDataset help

2015-10-15 Thread Christopher Wilson
Has anyone gotten Kite to work on HDP?  I'd wanted to do this very thing
but am running into all kinds of issues with having .jar files not in the
distributed cache (basically in /apps/hdp).

Any feedback appreciated.

-Chris

On Sat, Sep 19, 2015 at 11:04 AM, Tyler Hawkes 
wrote:

> Thanks for the link. I'm using
> "dataset:hive://hadoop01:9083/default/sandwiches". hadoop01 has hive on it.
>
> On Fri, Sep 18, 2015 at 7:36 AM Jeff  wrote:
>
>> Not sure if this is what you are looking for but it has a bit on kite.
>>
>> http://ingest.tips/2014/12/22/getting-started-with-apache-nifi/
>>
>> -cb
>>
>>
>> On Sep 18, 2015, at 8:32 AM, Bryan Bende  wrote:
>>
>> Hi Tyler,
>>
>> Unfortunately I don't think there are any tutorials on this. Can you
>> provide an example of the dataset uri you specified that is showing as
>> invalid?
>>
>> Thanks,
>>
>> Bryan
>>
>> On Fri, Sep 18, 2015 at 12:36 AM, Tyler Hawkes 
>> wrote:
>>
>>> I'm just getting going on NiFi and trying to write data to Hive either
>>> from Kafka or an RDBMS. After setting up the hadoop configuration files and
>>> a target dataset uri it says the uri is invalid. I'm wondering if there's a
>>> tutorial on getting kite set up with my version of hive (HDP 2.2 running
>>> hive 0.14) and nifi since I've been unable to find anything on google or on
>>> the mailing list archive and the documentation of StoreInKiteDataset it
>>> lacking a lot of detail.
>>>
>>> Any help on this would be greatly appreciated.
>>>
>>
>>
>>


Re: StoreInKiteDataset help

2015-10-15 Thread Joe Witt
Chris,

Are you seeing errors in NiFi or in HDP?  If you're seeing errors in
NiFi can you please send us the logs?

Thanks
Joe

On Thu, Oct 15, 2015 at 3:02 PM, Christopher Wilson  wrote:
> Has anyone gotten Kite to work on HDP?  I'd wanted to do this very thing but
> am running into all kinds of issues with having .jar files not in the
> distributed cache (basically in /apps/hdp).
>
> Any feedback appreciated.
>
> -Chris
>
> On Sat, Sep 19, 2015 at 11:04 AM, Tyler Hawkes 
> wrote:
>>
>> Thanks for the link. I'm using
>> "dataset:hive://hadoop01:9083/default/sandwiches". hadoop01 has hive on it.
>>
>> On Fri, Sep 18, 2015 at 7:36 AM Jeff  wrote:
>>>
>>> Not sure if this is what you are looking for but it has a bit on kite.
>>>
>>> http://ingest.tips/2014/12/22/getting-started-with-apache-nifi/
>>>
>>> -cb
>>>
>>>
>>> On Sep 18, 2015, at 8:32 AM, Bryan Bende  wrote:
>>>
>>> Hi Tyler,
>>>
>>> Unfortunately I don't think there are any tutorials on this. Can you
>>> provide an example of the dataset uri you specified that is showing as
>>> invalid?
>>>
>>> Thanks,
>>>
>>> Bryan
>>>
>>> On Fri, Sep 18, 2015 at 12:36 AM, Tyler Hawkes 
>>> wrote:

 I'm just getting going on NiFi and trying to write data to Hive either
 from Kafka or an RDBMS. After setting up the hadoop configuration files and
 a target dataset uri it says the uri is invalid. I'm wondering if there's a
 tutorial on getting kite set up with my version of hive (HDP 2.2 running
 hive 0.14) and nifi since I've been unable to find anything on google or on
 the mailing list archive and the documentation of StoreInKiteDataset it
 lacking a lot of detail.

 Any help on this would be greatly appreciated.
>>>
>>>
>>>
>


Provenance doesn't work with FetchS3Object

2015-10-15 Thread Ben Meng
I understand that FetchS3Object processor requires an incoming FlowFile to 
trigger it. The problem is that FetchS3Object emits a RECEIVE provenance event 
for the existing FlowFile. That event causes following error when I try to open 
the lineage chart for a simple flow: GenerateFlowFile -> FetchS3Object.

"Found cycle in graph. This indicates that multiple events were registered 
claiming to have generated the same FlowFile (UUID = 
40f58407-ea10-4843-b8d1-be0e24f685aa)"

Should FetchS3Object create a new FlowFile for each fetched object? If so, does 
it really require an incoming FlowFile?

Regards,
Ben

The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.



Re: Provenance doesn't work with FetchS3Object

2015-10-15 Thread Oleg Zhurakousky
Ben

I don’t think it needs an incoming FlowFile. It is a scheduled component and 
will retrieve contents based on how you configure scheduling.
Have you tried it without incoming FlowFiles?

Cheers
Oleg

On Oct 15, 2015, at 3:38 PM, Ben Meng 
mailto:ben.m...@lifelock.com>> wrote:

I understand that FetchS3Object processor requires an incoming FlowFile to 
trigger it. The problem is that FetchS3Object emits a RECEIVE provenance event 
for the existing FlowFile. That event causes following error when I try to open 
the lineage chart for a simple flow: GenerateFlowFile -> FetchS3Object.

"Found cycle in graph. This indicates that multiple events were registered 
claiming to have generated the same FlowFile (UUID = 
40f58407-ea10-4843-b8d1-be0e24f685aa)"

Should FetchS3Object create a new FlowFile for each fetched object? If so, does 
it really require an incoming FlowFile?

Regards,
Ben

The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.




Re: Provenance doesn't work with FetchS3Object

2015-10-15 Thread Ben Meng
Oleg,

Yes, I’ve tried running FetchS3Object without any incoming FlowFile, and it 
just didn’t generate any output. I’ve also confirmed the behavior by inspecting 
the codes. The first thing it does is to check if there’s an existing FlowFile, 
and returns if there isn’t.

Regards,
Ben


From: Oleg Zhurakousky
Reply-To: "users@nifi.apache.org"
Date: Thursday, October 15, 2015 at 12:49 PM
To: "users@nifi.apache.org"
Subject: Re: Provenance doesn't work with FetchS3Object

Ben

I don’t think it needs an incoming FlowFile. It is a scheduled component and 
will retrieve contents based on how you configure scheduling.
Have you tried it without incoming FlowFiles?

Cheers
Oleg

On Oct 15, 2015, at 3:38 PM, Ben Meng 
mailto:ben.m...@lifelock.com>> wrote:

I understand that FetchS3Object processor requires an incoming FlowFile to 
trigger it. The problem is that FetchS3Object emits a RECEIVE provenance event 
for the existing FlowFile. That event causes following error when I try to open 
the lineage chart for a simple flow: GenerateFlowFile -> FetchS3Object.

"Found cycle in graph. This indicates that multiple events were registered 
claiming to have generated the same FlowFile (UUID = 
40f58407-ea10-4843-b8d1-be0e24f685aa)"

Should FetchS3Object create a new FlowFile for each fetched object? If so, does 
it really require an incoming FlowFile?

Regards,
Ben

The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.



The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.



Re: Provenance doesn't work with FetchS3Object

2015-10-15 Thread Mark Payne
Ben,

Since FetchS3Object is not creating the FlowFile, it should not be emitting a 
RECEIVE event. This is certainly a bug.

I have created a ticket for this: 
https://issues.apache.org/jira/browse/NIFI-1038 


Thanks
-Mark


> On Oct 15, 2015, at 3:57 PM, Ben Meng  wrote:
> 
> Oleg,
> 
> Yes, I’ve tried running FetchS3Object without any incoming FlowFile, and it 
> just didn’t generate any output. I’ve also confirmed the behavior by 
> inspecting the codes. The first thing it does is to check if there’s an 
> existing FlowFile, and returns if there isn’t.
> 
> Regards,
> Ben
> 
> 
> From: Oleg Zhurakousky
> Reply-To: "users@nifi.apache.org "
> Date: Thursday, October 15, 2015 at 12:49 PM
> To: "users@nifi.apache.org "
> Subject: Re: Provenance doesn't work with FetchS3Object
> 
> Ben
> 
> I don’t think it needs an incoming FlowFile. It is a scheduled component and 
> will retrieve contents based on how you configure scheduling.
> Have you tried it without incoming FlowFiles?
> 
> Cheers
> Oleg
> 
>> On Oct 15, 2015, at 3:38 PM, Ben Meng > > wrote:
>> 
>> I understand that FetchS3Object processor requires an incoming FlowFile to 
>> trigger it. The problem is that FetchS3Object emits a RECEIVE provenance 
>> event for the existing FlowFile. That event causes following error when I 
>> try to open the lineage chart for a simple flow: GenerateFlowFile -> 
>> FetchS3Object.
>> 
>> "Found cycle in graph. This indicates that multiple events were registered 
>> claiming to have generated the same FlowFile (UUID = 
>> 40f58407-ea10-4843-b8d1-be0e24f685aa)"
>> 
>> Should FetchS3Object create a new FlowFile for each fetched object? If so, 
>> does it really require an incoming FlowFile?
>> 
>> Regards,
>> Ben
>> The information contained in this transmission may contain privileged and 
>> confidential information. It is intended only for the use of the person(s) 
>> named above. If you are not the intended recipient, you are hereby notified 
>> that any review, dissemination, distribution or duplication of this 
>> communication is strictly prohibited. If you are not the intended recipient, 
>> please contact the sender by reply email and destroy all copies of the 
>> original message.
> 
> The information contained in this transmission may contain privileged and 
> confidential information. It is intended only for the use of the person(s) 
> named above. If you are not the intended recipient, you are hereby notified 
> that any review, dissemination, distribution or duplication of this 
> communication is strictly prohibited. If you are not the intended recipient, 
> please contact the sender by reply email and destroy all copies of the 
> original message.



Re: Provenance doesn't work with FetchS3Object

2015-10-15 Thread Ben Meng
Thanks Mark. That makes sense.

Regards,
Ben

From: Mark Payne
Reply-To: "users@nifi.apache.org"
Date: Thursday, October 15, 2015 at 1:09 PM
To: "users@nifi.apache.org"
Subject: Re: Provenance doesn't work with FetchS3Object

Ben,

Since FetchS3Object is not creating the FlowFile, it should not be emitting a 
RECEIVE event. This is certainly a bug.

I have created a ticket for this: 
https://issues.apache.org/jira/browse/NIFI-1038

Thanks
-Mark


On Oct 15, 2015, at 3:57 PM, Ben Meng 
mailto:ben.m...@lifelock.com>> wrote:

Oleg,

Yes, I’ve tried running FetchS3Object without any incoming FlowFile, and it 
just didn’t generate any output. I’ve also confirmed the behavior by inspecting 
the codes. The first thing it does is to check if there’s an existing FlowFile, 
and returns if there isn’t.

Regards,
Ben


From: Oleg Zhurakousky
Reply-To: "users@nifi.apache.org"
Date: Thursday, October 15, 2015 at 12:49 PM
To: "users@nifi.apache.org"
Subject: Re: Provenance doesn't work with FetchS3Object

Ben

I don’t think it needs an incoming FlowFile. It is a scheduled component and 
will retrieve contents based on how you configure scheduling.
Have you tried it without incoming FlowFiles?

Cheers
Oleg

On Oct 15, 2015, at 3:38 PM, Ben Meng 
mailto:ben.m...@lifelock.com>> wrote:

I understand that FetchS3Object processor requires an incoming FlowFile to 
trigger it. The problem is that FetchS3Object emits a RECEIVE provenance event 
for the existing FlowFile. That event causes following error when I try to open 
the lineage chart for a simple flow: GenerateFlowFile -> FetchS3Object.

"Found cycle in graph. This indicates that multiple events were registered 
claiming to have generated the same FlowFile (UUID = 
40f58407-ea10-4843-b8d1-be0e24f685aa)"

Should FetchS3Object create a new FlowFile for each fetched object? If so, does 
it really require an incoming FlowFile?

Regards,
Ben

The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.



The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.



The information contained in this transmission may contain privileged and 
confidential information. It is intended only for the use of the person(s) 
named above. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, distribution or duplication of this 
communication is strictly prohibited. If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message.



rest api synchronize

2015-10-15 Thread Devin Pinkston
I'm using the REST API to instantiate templates, start process groups, and
also remove those process groups.  However, when i try to delete a process
group through the REST API, I keep getting the following 409 response:

"This NiFi instance has been updated by 'anonymous'. Please refresh to
synchronize the view."

Am i missing something when using the REST API?

Thank you.


Re: rest api synchronize

2015-10-15 Thread Matt Gilman
Devin,

NiFi employs an optimistic locking scheme that requires clients to pass in
a revision when making a mutable request. The revision is comprised of a
version (a number that increments with each modification) and a client id.
The client id can be any string. Typically however, you'll use the client
id that's generated for you when you make your first request (like
retrieving the entire flow). You should continue to use this client id with
all subsequent requests. In order for a mutable request to be accepted, a
client must either have the current version or the client id that last
modified the flow. The second check allows clients to submit requests
asynchronous without having to wait for preceding requests to complete.

I am guessing that based on your error message your not including a client
id in your request and your revision is not current. Each successful
mutable request will contain the updated revision in the response.

Let me know if this helps. If not could you send the requests your making?

Thanks!

Matt

On Thu, Oct 15, 2015 at 8:13 PM, Devin Pinkston 
wrote:

> I'm using the REST API to instantiate templates, start process groups, and
> also remove those process groups.  However, when i try to delete a process
> group through the REST API, I keep getting the following 409 response:
>
> "This NiFi instance has been updated by 'anonymous'. Please refresh to
> synchronize the view."
>
> Am i missing something when using the REST API?
>
> Thank you.
>