Igor, There is no automatic failover of the the node that is considered primary. For the upcoming 1.x release though this has been addressed https://issues.apache.org/jira/browse/NIFI-483
Thanks Joe On Sun, May 1, 2016 at 2:36 PM, Igor Kravzov <igork.ine...@gmail.com> wrote: > Thanks Aldrin for the repose. > What didn't fully understand from documentation: is automatic fail-over > implemented? I would rather configure entire workflow to run "On primary > node". > > > On Sun, May 1, 2016 at 1:31 PM, Aldrin Piri <aldrinp...@gmail.com> wrote: >> >> Igor, >> >> Your thoughts are correct, and without any additional configuration, the >> GetTwitter processor would run on both nodes. The way to avoid this is to >> select the "On primary node" scheduling strategy which would only have the >> processor run on whichever node is currently primary. >> >> PutHDFS has similar semantics but these would likely be desired. Consider >> where data is partitioned across each of the nodes. PutHDFS would then need >> to run on each node to ensure the data is delivered to HDFS. The property >> you list is that of where the data should land on the configured HDFS >> instance. Often times this is done via Expression Language (EL) to get the >> familiar time slicing of resources when persisted such as >> ${now():format('yyyy/MM/dd/HH')}. You could additionally have directory >> structure that mirrors the data making use of attributes the files may have >> gained as they made their way through your flow or an UpdateAttribute to set >> a property, such as "hadoop.dest.dir", that is used by the final PutHDFS >> property to give a dynamic location on a per FlowFile basis. >> >> Let us know if you have additional questions or if things are unclear. >> >> --aldrin >> >> >> On Sun, May 1, 2016 at 1:20 PM, Igor Kravzov <igork.ine...@gmail.com> >> wrote: >>> >>> If I understand correctly in cluster mode the same dataflow runs on all >>> the notes. >>> So let's say I have a simple dataflow with GetTwitter and PutHDFS >>> processors. And one NCM + 2 nodes. >>> Does it actually that mean the GetTwitter will be called independently >>> and potentially simultaneously on each node and there may be duplicate >>> results? >>> How about PutHDFS processor? To where "hadoop configuration resources" >>> "parent HDFS directory" should point to in each node? >> >> >