Concern regarding Nifi Processor's state is captured durning upgrade of nif cluster cluster

2021-02-24 Thread sanjeet rath
Hi,

My use case is to upgrade the nifi cluster from 1.8 to 1.12.1 with state (we
are using external zookeeper and its 3 node cluster ).

So the approach I followed ,
-> created 3 node linux box and installed nifi 1.12.1 & zookeeper 3.5.8
-> brought  the flow.xml.gz, users.xml and authorization.xml from old 1.8
env to newly created 1.12 cluster.(Both cluster are in different linux box)
-> Then followed with the use of zk-migrator.sh utility to create a
zk-source-data.json in dev 1.8 env and applying it in 1.12 cluster to apply
the states.

Everything is fine with the above approach and state is captured in the
processor.


I am seeing another wired behaviour which  is , *without*
using zk-migrator.sh the state is captured properly by bringing just the
flow.xml.gz

Steps:
-> created 3 node linux box and installed nifi 1.12.1 & zookeeper 3.5.8
-> brought  the flow.xml.gz, users.xml and authorization.xml from old 1.8
env to newly created 1.12 cluster. and i have *not* used zk-migrator.sh
utility in in newly created 1.12.1 env.
-> When i am seeing processors like ListS3, Listsftp, the state id is
captured properly in newly created 1.12.1 env, also when i run
the processor it is pulling the  data after the captured timestamp only
,which is perfect.

Could someone help me to understand how the state id is captured in
the lists3 processor in 1.12.1 env, because i am seeing its not present in
flow.xml.gz
 Does it mean we can migrate the state by just bringing the flow.xml file
without using zk-migrator.sh utilyty ?
 one more question where does the zk-migrator.sh utility write the states
in the destination cluster in the external zookeper configuration
Is it inside /nifi/state/locale/* ?

Thanks in advance,
-- 
Sanjeet Kumar Rath,


Re: Concern regarding Nifi Processor's state is captured durning upgrade of nif cluster cluster

2021-02-24 Thread Mark Payne
Sanjeet,

For this use case, you should not be using the zk-migrator.sh tool. That tool 
is not intended to be used when upgrading nifi. Rather, the tool is to be used 
if you’re migrating nifi away from one zookeeper and onto another. For example, 
if you have a ZooKeeper instance that is shared by many other services, and you 
decide that you want to run a separate ZooKeeper instance purely for NiFi, then 
you could use that tool to copy the nifi state from one zookeeper instance to 
another.

NiFi has two types of state: Local and Cluster state. ListS3, for example, 
would use Cluster state because if the Primary Node changes, the new node in 
the cluster needs to have that same state. So the state must be shared across 
the cluster. So the state itself is stored in ZooKeeper.

However, you could also have something like ListFile running on every node in 
the cluster, listing files on the local disk. In such a case, if Node 1 
performs some listing, it does not make sense to share that state with Node 2, 
because Node 2 has a completely different listing (a completely different 
disk). So the state is then stored in the ./state/local directory.

So when you upgrade from 1.8 to 1.12.1 you should also copy over the state 
directory to avoid losing any local state. But as long as the 1.12.1 cluster is 
pointing at the same zookeeper, there is no need to migrate the zookeeper state.

Hope this helps!
-Mark


On Feb 24, 2021, at 8:56 AM, sanjeet rath 
mailto:rath.sanj...@gmail.com>> wrote:

Hi,

My use case is to upgrade the nifi cluster from 1.8 to 1.12.1 with state (we 
are using external zookeeper and its 3 node cluster ).

So the approach I followed ,
-> created 3 node linux box and installed nifi 1.12.1 & zookeeper 3.5.8
-> brought  the flow.xml.gz, users.xml and authorization.xml from old 1.8 env 
to newly created 1.12 cluster.(Both cluster are in different linux box)
-> Then followed with the use of zk-migrator.sh utility to create a 
zk-source-data.json in dev 1.8 env and applying it in 1.12 cluster to apply the 
states.

Everything is fine with the above approach and state is captured in the 
processor.


I am seeing another wired behaviour which  is , without using zk-migrator.sh 
the state is captured properly by bringing just the flow.xml.gz

Steps:
-> created 3 node linux box and installed nifi 1.12.1 & zookeeper 3.5.8
-> brought  the flow.xml.gz, users.xml and authorization.xml from old 1.8 env 
to newly created 1.12 cluster. and i have not used zk-migrator.sh utility in in 
newly created 1.12.1 env.
-> When i am seeing processors like ListS3, Listsftp, the state id is captured 
properly in newly created 1.12.1 env, also when i run the processor it is 
pulling the  data after the captured timestamp only ,which is perfect.

Could someone help me to understand how the state id is captured in the lists3 
processor in 1.12.1 env, because i am seeing its not present in flow.xml.gz
 Does it mean we can migrate the state by just bringing the flow.xml file 
without using zk-migrator.sh utilyty ?
 one more question where does the zk-migrator.sh utility write the states in 
the destination cluster in the external zookeper configuration
Is it inside /nifi/state/locale/* ?

Thanks in advance,
--
Sanjeet Kumar Rath,




Groovy script

2021-02-24 Thread Tomislav Novosel
Hi guys,

I want to check if file exists with this groovy script:

flowfile = session.get()
if(!flowfile) return
file_path = flowfile.getAttribute('file_path')
File file = new File(file_path)
if(file.exists()){
session.transfer(flowfile, REL_FAILURE)
}
else{
session.transfer(flowfile, REL_SUCCESS)
}

and to route all files which exist to FAILURE relationship, but all of them go 
to SUCCESS, file is for sure in the folder
'file_path', I checked.

What am I doing wrong?

Thanks,

Tom


Re: Concern regarding Nifi Processor's state is captured durning upgrade of nif cluster cluster

2021-02-24 Thread sanjeet rath
Thanks Mark for the detailed explanation.
Along with nifi i am also upgrading zookeeper 3.4 version to 3.5 version.
So My nifi1.8 version runs on 3.4 version & new 1.12.1 version nifi runs on
3.5 version of zookeeper.and both zookeper are in differnet linux server
running.

That is why i need to transfer the state from zookeeper 3.4 instance to 3.5
instance.

But my original question is when i just copy the flow.xml.gz file from 1.8
version nifi instance to new 1.12 nifi instance and please note  this new
1.12 nifi instance runs on different zookeeper instance, then how the state
id is present in the 1.12 version of nifi  lists3 processor and it starts
prcessing where it left in 1.8 environment.I have not applied
zk-migrator.sh to this new 3.5 zookeeper instance.So my expection the state
id should not be prensent here.





On Wed, 24 Feb 2021, 8:03 pm Mark Payne,  wrote:

> Sanjeet,
>
> For this use case, you should not be using the zk-migrator.sh tool. That
> tool is not intended to be used when upgrading nifi. Rather, the tool is to
> be used if you’re migrating nifi away from one zookeeper and onto another.
> For example, if you have a ZooKeeper instance that is shared by many other
> services, and you decide that you want to run a separate ZooKeeper instance
> purely for NiFi, then you could use that tool to copy the nifi state from
> one zookeeper instance to another.
>
> NiFi has two types of state: Local and Cluster state. ListS3, for example,
> would use Cluster state because if the Primary Node changes, the new node
> in the cluster needs to have that same state. So the state must be shared
> across the cluster. So the state itself is stored in ZooKeeper.
>
> However, you could also have something like ListFile running on every node
> in the cluster, listing files on the local disk. In such a case, if Node 1
> performs some listing, it does not make sense to share that state with Node
> 2, because Node 2 has a completely different listing (a completely
> different disk). So the state is then stored in the ./state/local directory.
>
> So when you upgrade from 1.8 to 1.12.1 you should also copy over the state
> directory to avoid losing any local state. But as long as the 1.12.1
> cluster is pointing at the same zookeeper, there is no need to migrate the
> zookeeper state.
>
> Hope this helps!
> -Mark
>
>
> On Feb 24, 2021, at 8:56 AM, sanjeet rath  wrote:
>
> Hi,
>
> My use case is to upgrade the nifi cluster from 1.8 to 1.12.1 with state (we
> are using external zookeeper and its 3 node cluster ).
>
> So the approach I followed ,
> -> created 3 node linux box and installed nifi 1.12.1 & zookeeper 3.5.8
> -> brought  the flow.xml.gz, users.xml and authorization.xml from old 1.8
> env to newly created 1.12 cluster.(Both cluster are in different linux box)
> -> Then followed with the use of zk-migrator.sh utility to create a
> zk-source-data.json in dev 1.8 env and applying it in 1.12 cluster to apply
> the states.
>
> Everything is fine with the above approach and state is captured in the
> processor.
>
>
> I am seeing another wired behaviour which  is , *without*
> using zk-migrator.sh the state is captured properly by bringing just the
> flow.xml.gz
>
> Steps:
> -> created 3 node linux box and installed nifi 1.12.1 & zookeeper 3.5.8
> -> brought  the flow.xml.gz, users.xml and authorization.xml from old 1.8
> env to newly created 1.12 cluster. and i have * not* used zk-migrator.sh
> utility in in newly created 1.12.1 env.
> -> When i am seeing processors like ListS3, Listsftp, the state id is
> captured properly in newly created 1.12.1 env, also when i run
> the processor it is pulling the  data after the captured timestamp only
> ,which is perfect.
>
> Could someone help me to understand how the state id is captured in
> the lists3 processor in 1.12.1 env, because i am seeing its not present in
> flow.xml.gz
>  Does it mean we can migrate the state by just bringing the flow.xml file
> without using zk-migrator.sh utilyty ?
>  one more question where does the zk-migrator.sh utility write the states
> in the destination cluster in the external zookeper configuration
> Is it inside /nifi/state/locale/* ?
>
> Thanks in advance,
> --
> Sanjeet Kumar Rath,
>
>
>


Re: Groovy script

2021-02-24 Thread Mike Thomsen
If file_path is pointing to a folder as you said, it's going to check
for the folder's existence. The fact that it's failing to return true
there suggests that something is wrong with the path in the file_path
attribute.

On Wed, Feb 24, 2021 at 11:47 AM Tomislav Novosel
 wrote:
>
> Hi guys,
>
>
>
> I want to check if file exists with this groovy script:
>
>
>
> flowfile = session.get()
> if(!flowfile) return
>
> file_path = flowfile.getAttribute('file_path')
> File file = new File(file_path)
>
> if(file.exists()){
> session.transfer(flowfile, REL_FAILURE)
> }
> else{
> session.transfer(flowfile, REL_SUCCESS)
> }
>
>
>
> and to route all files which exist to FAILURE relationship, but all of them 
> go to SUCCESS, file is for sure in the folder
>
> ‘file_path’, I checked.
>
>
>
> What am I doing wrong?
>
>
>
> Thanks,
>
>
>
> Tom


RE: Groovy script

2021-02-24 Thread Tomislav Novosel
Hi Mike,

attribute 'file_path' is not pointing to folder only, it has value 
/path/to/filename, so it is like /opt/data/folder/filename.txt. The attribute 
value is ok, I double checked.

Tom
-Original Message-
From: Mike Thomsen  
Sent: 24 February 2021 18:00
To: users@nifi.apache.org
Subject: Re: Groovy script

If file_path is pointing to a folder as you said, it's going to check for the 
folder's existence. The fact that it's failing to return true there suggests 
that something is wrong with the path in the file_path attribute.

On Wed, Feb 24, 2021 at 11:47 AM Tomislav Novosel 
 wrote:
>
> Hi guys,
>
>
>
> I want to check if file exists with this groovy script:
>
>
>
> flowfile = session.get()
> if(!flowfile) return
>
> file_path = flowfile.getAttribute('file_path')
> File file = new File(file_path)
>
> if(file.exists()){
> session.transfer(flowfile, REL_FAILURE) } else{ 
> session.transfer(flowfile, REL_SUCCESS) }
>
>
>
> and to route all files which exist to FAILURE relationship, but all of 
> them go to SUCCESS, file is for sure in the folder
>
> ‘file_path’, I checked.
>
>
>
> What am I doing wrong?
>
>
>
> Thanks,
>
>
>
> Tom


Re: Groovy script

2021-02-24 Thread Etienne Jouvin
Hello.

Not sure, but I already did something like this.
But unlike Java, I defined variable with keyword def

flowfile = session.get()
> if(!flowfile) return
> def filePath = flowfile.getAttribute('file_path')
> def file = new File(file_path)
> if(file.exists()){
> session.transfer(flowfile, REL_FAILURE)
> } else {
> session.transfer(flowfile, REL_SUCCESS)
> }


Could be something like this.



Le mer. 24 févr. 2021 à 17:47, Tomislav Novosel <
tomislav.novo...@clearpeaks.com> a écrit :

> Hi guys,
>
>
>
> I want to check if file exists with this groovy script:
>
>
>
> flowfile = session.get()
> if(!flowfile) return
>
> file_path = flowfile.getAttribute('file_path')
> File file = new File(file_path)
>
> if(file.exists()){
> session.transfer(flowfile, REL_FAILURE)
> }
> else{
> session.transfer(flowfile, REL_SUCCESS)
> }
>
>
>
> and to route all files which exist to FAILURE relationship, but all of
> them go to SUCCESS, file is for sure in the folder
>
> ‘file_path’, I checked.
>
>
>
> What am I doing wrong?
>
>
>
> Thanks,
>
>
>
> Tom
>


Incremental Fetch in NIFI

2021-02-24 Thread KhajaAsmath Mohammed
Hi,

I have a use case where I need to do incremental fetch on the oracle
tables. Is there a easy way to do this? I saw some posts about
querydatabase table. want to check if there is any efficient way to do this?

Thanks,
Khaja


Re: Incremental Fetch in NIFI

2021-02-24 Thread Matt Burgess
Khaja,

There are two options in NiFi for incremental database fetch:
QueryDatabaseTable and GenerateTableFetch. The former is more often
used on a standalone NiFi cluster for single tables (as it does not
accept an incoming connection). It generates the SQL needed to do
incremental fetching, then executes the statements and writes out the
rows to the outgoing flowfile(s).  GenerateTableFetch is meant for a
cluster or for multiple tables (as it does accept an incoming
connection), and does the "first half" of what QueryDatabaseTable
does, generating SQL statements but not executing them. The statements
are written out as FlowFiles to be executed downstream.

In a cluster, these processors are meant to run on the primary node
only, otherwise each node will fetch the same information from the
database and handle it in its own copy of the flow. Set the processor
to run on the Primary Node Only. If using GenerateTableFetch, you can
distribute the generated SQL statements using a Remote Process Group
-> Input Port or a Load-Balanced Connection to an ExecuteSQL processor
downstream, which will parallelize the actual fetching among the
cluster nodes.

Regards,
Matt

On Wed, Feb 24, 2021 at 3:46 PM KhajaAsmath Mohammed
 wrote:
>
> Hi,
>
> I have a use case where I need to do incremental fetch on the oracle tables. 
> Is there a easy way to do this? I saw some posts about querydatabase table. 
> want to check if there is any efficient way to do this?
>
> Thanks,
> Khaja


some questions about splits

2021-02-24 Thread Greene (US), Geoffrey N
Im having some trouble with multiple splits/merges.  Here's the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 
2-> save all the fragment.* attributes
|
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to 
do a merge on some of the split data (errors like "there are 50 fragments, but 
we only found one).  Is it possible that I need to do some prioritization in 
the queues? I have noticed that my things do back up and the queues seem to 
fill up as its going through (several of the splits need to perform rest calls 
and processing, which can take time.  Maybe the issue is that one fragment 
"slips" through, before the others have even been processed far enough.  Is 
there an approved way to do this?

Thanks for the help!




Re: some questions about splits

2021-02-24 Thread Mark Payne
Geoffrey,

At a high level, if you’re splitting multiple times and then trying to 
re-assemble everything, then yes I think your thought process is correct. But 
you’ve no doubt seen how complex and cumbersome this approach can be. It can 
also result in extremely poor performance. So much so that when I began 
creating a series of YouTube videos on NiFi Anti-Patterns, the first 
anti-pattern that I covered was the splitting and re-merging of data [1].

Generally, this should be an absolute last resort, and Record-oriented 
processors should be used instead of splitting the data up and re-merging it. 
If you need to perform REST calls, you could do that with LookupRecord, and 
either use the RESTLookupService or if that doesn’t fit the bill exactly you 
could actually use the ScriptedLookupService and write a small script in Groovy 
or Python that would perform the REST call for you and return the results. Or 
perhaps the ScriptedTransformRecord would be more appropriate - hard to tell 
without knowing the exact use case.

Obviously, your mileage may vary, but switching the data flow to use 
record-oriented processors, if possible, would typically yield a flow that is 
much simpler and yield throughput that is at least an order of magnitude better.

But if for whatever reason you do end up being stuck with the split/merge 
approach - the key would likely be to consider backpressure heavily. If you 
have backpressure set to 10,000 FlowFiles (the default) and then you’re trying 
to merge together data, but the data comes from many different upstream splits, 
you can certainly end up in a situation like this, where you don’t have all of 
the data from a given ’split’ queued up. for MergeContent.

Hope this helps!
-Mark

[1] https://www.youtube.com/watch?v=RjWstt7nRVY


On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N 
mailto:geoffrey.n.gre...@boeing.com>> wrote:

Im having some trouble with multiple splits/merges.  Here’s the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 
2-> save all the fragment.* attributes
|
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to 
do a merge on some of the split data (errors like “there are 50 fragments, but 
we only found one).  Is it possible that I need to do some prioritization in 
the queues? I have noticed that my things do back up and the queues seem to 
fill up as its going through (several of the splits need to perform rest calls 
and processing, which can take time.  Maybe the issue is that one fragment 
“slips” through, before the others have even been processed far enough.  Is 
there an approved way to do this?

Thanks for the help!



RE: some questions about splits

2021-02-24 Thread Greene (US), Geoffrey N
Thank you for the fast response Mark.

Hrm, record processing does sound useful.

Are there any good blogs / documentation on this?  I’d really like to learn 
more.  I’ve been doing mostly text processing, as you’ve observed.

My use case is something like this

1)  Use server API to get list of sensors

2)  Use server API to get list of jobs

3)  For each job, get count of frames/job.  There are up to 6 sets of 
frames/job, depending on which set you query for.

4)  Frames can only be queried 50 at a time, so for each 50, get set of 
time stamps of a frame

5)  For each time stamp, query for all sensor values at that time.  These 
have to be queried one-at-a-time., because of the way the API works – one 
sensor value can only be given for one time.

6)  Glue all the data associated with that job (frame times, sensor 
readings,etc) together and paste in a big json. (There are more steps after 
that.


Geoffrey Greene
Associate Technical Fellow/Senior Software Ninjaneer
(703) 414 2421
The Boeing Company

From: Mark Payne [mailto:marka...@hotmail.com]
Sent: Wednesday, February 24, 2021 5:20 PM
To: users@nifi.apache.org
Subject: [EXTERNAL] Re: some questions about splits


EXT email: be mindful of links/attachments.




Geoffrey,

At a high level, if you’re splitting multiple times and then trying to 
re-assemble everything, then yes I think your thought process is correct. But 
you’ve no doubt seen how complex and cumbersome this approach can be. It can 
also result in extremely poor performance. So much so that when I began 
creating a series of YouTube videos on NiFi Anti-Patterns, the first 
anti-pattern that I covered was the splitting and re-merging of data [1].

Generally, this should be an absolute last resort, and Record-oriented 
processors should be used instead of splitting the data up and re-merging it. 
If you need to perform REST calls, you could do that with LookupRecord, and 
either use the RESTLookupService or if that doesn’t fit the bill exactly you 
could actually use the ScriptedLookupService and write a small script in Groovy 
or Python that would perform the REST call for you and return the results. Or 
perhaps the ScriptedTransformRecord would be more appropriate - hard to tell 
without knowing the exact use case.

Obviously, your mileage may vary, but switching the data flow to use 
record-oriented processors, if possible, would typically yield a flow that is 
much simpler and yield throughput that is at least an order of magnitude better.

But if for whatever reason you do end up being stuck with the split/merge 
approach - the key would likely be to consider backpressure heavily. If you 
have backpressure set to 10,000 FlowFiles (the default) and then you’re trying 
to merge together data, but the data comes from many different upstream splits, 
you can certainly end up in a situation like this, where you don’t have all of 
the data from a given ’split’ queued up. for MergeContent.

Hope this helps!
-Mark

[1] https://www.youtube.com/watch?v=RjWstt7nRVY



On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N 
mailto:geoffrey.n.gre...@boeing.com>> wrote:

Im having some trouble with multiple splits/merges.  Here’s the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 
2-> save all the fragment.* attributes
|
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to 
do a merge on some of the split data (errors like “there are 50 fragments, but 
we only found one).  Is it possible that I need to do some prioritization in 
the queues? I have noticed that my things do back up and the queues seem to 
fill up as its going through (several of the splits need to perform rest calls 
and processing, which can take time.  Maybe the issue is that one fragment 
“slips” through, before the others have even been processed far enough.  Is 
there an approved way to do this?

Thanks for the help!



Re: some questions about splits

2021-02-24 Thread Matt Burgess
Geoffrey,

There's a really good blog by the man himself [1] :) I highly recommend the
official blog in general, lots of great posts and many are record-oriented
[2]

Regards,
Matt

[1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
[2] https://blogs.apache.org/nifi/

On Wed, Feb 24, 2021 at 5:57 PM Greene (US), Geoffrey N <
geoffrey.n.gre...@boeing.com> wrote:

> Thank you for the fast response Mark.
>
>
>
> Hrm, record processing does sound useful.
>
>
>
> Are there any good blogs / documentation on this?  I’d really like to
> learn more.  I’ve been doing mostly text processing, as you’ve observed.
>
>
>
> My use case is something like this
>
> 1)  Use server API to get list of sensors
>
> 2)  Use server API to get list of jobs
>
> 3)  For each job, get count of frames/job.  There are up to 6 sets of
> frames/job, depending on which set you query for.
>
> 4)  Frames can only be queried 50 at a time, so for each 50, get set
> of time stamps of a frame
>
> 5)  For each time stamp, query for all sensor values at that time.
> These have to be queried one-at-a-time., because of the way the API works –
> one sensor value can only be given for one time.
>
> 6)  Glue all the data associated with that job (frame times, sensor
> readings,etc) together and paste in a big json. (There are more steps after
> that.
>
>
>
>
>
> Geoffrey Greene
>
> Associate Technical Fellow/Senior Software Ninjaneer
>
> (703) 414 2421
>
> The Boeing Company
>
>
>
> *From:* Mark Payne [mailto:marka...@hotmail.com]
> *Sent:* Wednesday, February 24, 2021 5:20 PM
> *To:* users@nifi.apache.org
> *Subject:* [EXTERNAL] Re: some questions about splits
>
>
>
> EXT email: be mindful of links/attachments.
>
>
>
>
> Geoffrey,
>
>
>
> At a high level, if you’re splitting multiple times and then trying to
> re-assemble everything, then yes I think your thought process is correct.
> But you’ve no doubt seen how complex and cumbersome this approach can be.
> It can also result in extremely poor performance. So much so that when I
> began creating a series of YouTube videos on NiFi Anti-Patterns, the first
> anti-pattern that I covered was the splitting and re-merging of data [1].
>
>
>
> Generally, this should be an absolute last resort, and Record-oriented
> processors should be used instead of splitting the data up and re-merging
> it. If you need to perform REST calls, you could do that with LookupRecord,
> and either use the RESTLookupService or if that doesn’t fit the bill
> exactly you could actually use the ScriptedLookupService and write a small
> script in Groovy or Python that would perform the REST call for you and
> return the results. Or perhaps the ScriptedTransformRecord would be more
> appropriate - hard to tell without knowing the exact use case.
>
>
>
> Obviously, your mileage may vary, but switching the data flow to use
> record-oriented processors, if possible, would typically yield a flow that
> is much simpler and yield throughput that is at least an order of magnitude
> better.
>
>
>
> But if for whatever reason you do end up being stuck with the split/merge
> approach - the key would likely be to consider backpressure heavily. If you
> have backpressure set to 10,000 FlowFiles (the default) and then you’re
> trying to merge together data, but the data comes from many different
> upstream splits, you can certainly end up in a situation like this, where
> you don’t have all of the data from a given ’split’ queued up. for
> MergeContent.
>
>
>
> Hope this helps!
>
> -Mark
>
>
>
> [1] https://www.youtube.com/watch?v=RjWstt7nRVY
>
>
>
>
>
> On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N <
> geoffrey.n.gre...@boeing.com> wrote:
>
>
>
> Im having some trouble with multiple splits/merges.  Here’s the idea:
>
>
>
>
>
> Big data -> split 1->Save all the fragment.*attributes into variables ->
> split 2-> save all the fragment.* attributes
>
> |
>
> Split 1
>
>|
>
> Save fragment.* attributes into split1.fragment.*
>
> |
>
> Split 2
>
> |
>
> Save fragment.* attributes into split2.fragment.* attributes
>
> |
>
> (More processing)
>
> |
>
> Split 3
>
> |
>
> Save fragment.* attributes into split3.fragment.* attributes
>
> |
>
> (other stuff)
>
> |
>
> Restore split3.fragment.* attributes to fragment.*
>
> |
>
> Merge3, using defragment strategy
>
> |
>
> Restore split2.fragment.* attributes to fragment.*
>
> |
>
> Merge 2 using defragment strategy
>
> |
>
> Restore split1.frragment.* attributes to fragment.*
>
> |
>
> Merge 1 using defragment strategy
>
>
>
> Am I thinking about this correctly?  It seems like sometimes, nifi is
> unable to do a merge on some of the split data (errors like “there are 50
> fragments, but we only found one).  Is it possible that I need to do some
> prioritization in the queues? I have noticed that my things do back up and
> the queues seem to fill up as its going through (several of the splits need
> to perform rest calls and processing, which can take 

Issue with QueryRecord failing when data is missing

2021-02-24 Thread Jens M. Kofoed
Hi all

I have a issue with using the QueryRecord query csv files. currently i'm
running NiFi version 1.12.1 but I also tested this in version 1.13.0
If my incoming csv file only have a header line and no data it fails

My querying statement looks like this: SELECT colA FROM FLOWFILE WHERE colC
= 'true'

Changes made to the CSVReader:
Treat Firs Line as Header = true

Changes made to the CSVRecordSetWriter:
Include Header Line = false
Record Separator = ,

Here are 2 sample data. The first one works as expected, but sample 2 gives
errors
Sample 1:
colA,colB,colC
data1A,data1B,true
data2A,data2B,false
data3A,data3B,true

Outcome: data1A,data3A,

Sample 2:
colA,colB,colC

Error message:
QueryRecord[id=d7c38f75-0177-1000--f694dd96] Unable to query
StandardFlowFileRecord[uuid=74a71c6e-3d3f-406c-92af-c9e4e27d6d69,claim=StandardContentClaim
[resourceClaim=StandardResourceClaim[id=1614232293848-3,
container=Node01Cont01, section=3], offset=463,
length=14],offset=0,name=74a71c6e-3d3f-406c-92af-c9e4e27d6d69,size=14] due
to java.sql.SQLException: Error while preparing statement [SELECT colA FROM
FLOWFILE WHERE colC = true]:
org.apache.nifi.processor.exception.ProcessException:
java.sql.SQLException: Error while preparing statement [SELECT colA FROM
FLOWFILE WHERE colC = true]

Is this a bug?

kind regards
Jens M. Kofoed


Why is totalSizeCap not a default parameter in the logback.xml file

2021-02-24 Thread Jens M. Kofoed
Hi

We have unfortunately had an incident where NiFi doing a weekend is filling
up the disk with logs because a process is failing and produce hundreds of
error messages per seconds.
We have changed the rollingPolicy to use daily rollover with a maxHistory
of 30 and maxFileSize at 100MB.
When the daily logfile is going to be bigger than maxFileSize the
rollingpolicy will create incrementing "subfiles" as .# for that day. But
the maxHistory does not count for subfiles.
So with a process producing hundreds of error messages per seconds you can
have a situation where you will end up with thousands of subfiles for each
day filling up the disk.

There are an attribute called "totalSizeCap" which has been asked for in
JIRA:
https://issues.apache.org/jira/browse/NIFI-2203
https://issues.apache.org/jira/browse/NIFI-4315

This attribute is already working, but is by default not included in the
logback.xml file nor in the new stateless-logback.xml file.

Example:


${org.apache.nifi.bootstrap.config.log.dir}/nifi-stateless.log



${org.apache.nifi.bootstrap.config.log.dir}/nifi-stateless_%d{-MM-dd_HH}.%i.log
100MB

30


*10GB*
true

%date %level [%thread] %logger{40} %msg%n



Please add this as a new default parameter

kind regards
Jens M. Kofoed