RE: List/Fetch pattern for QueryDatabaseTable

Vos, Walter Fri, 17 Aug 2018 01:21:16 -0700

Hi Matt,

Thanks for that very thorough explanation. I'll be sure to present this to my 
dev's in order to come to a way of working. When it comes to 
ListDatabaseTables, how can I use this when I don't want to query all tables 
from the database? Let's say there's some 30 tables in the database and I only 
want to query 8 of them. I realize there's a parameter called 'Table Name 
Pattern' in the processor, but how do I put together a pattern when the tables 
I'm targeting have very different names? Also, it seems a bit silly to open a 
database connections just to list the specific tables I'm looking to query 
(which I already know the names of). Is there a processor I can use where I can 
just fill out this list myself?

Perhaps these are very basic questions, by the way. Everybody here is learning 
how to work with NiFi and from an admin perspective I'm trying to provide some 
guiding on best practices for common flows.

Kind regards,

Walter

-----Oorspronkelijk bericht-----
Van: Matt Burgess [mailto:mattyb...@apache.org]
Verzonden: donderdag 16 augustus 2018 15:55
Aan: users@nifi.apache.org
Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable

Walter,

If you're looking to distribute database fetching among a cluster, then 
GenerateTableFetch is the right choice (over QueryDatabaseTable).
As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts incoming flow 
files, the capability was added in response to exactly the use case you 
outlined, distributed fetch of multiple tables via ListDatabaseTables. You 
still want your source processor to run on the Primary Node Only, otherwise all 
nodes get the same source data and as you said, you end up with duplicate data.

QueryDatabaseTable does not accept incoming connections, but you can use 
ExecuteSQL to actually do the fetching.  To distributed fetching of tables 
among the cluster, I recommend the following flow:

ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> 
GenerateTableFetch -> ExecuteSQL

Each node on the cluster will get flow files for the various tables in the 
database, then GenerateTableFetch will generate the SQL to fetch "pages" based 
on the Partition Size property, then ExecuteSQL will execute the statements. 
You can use multiple concurrent tasks for ExecuteSQL to perform the fetching 
concurrently for the SQL statements, and the RPG->Input port part will let you 
do the tables in parallel.  If instead you want to fully distribute the SQL 
execution (fetching various pages from various tables), you could move the
RPG->Input Port after GTF:

ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> RPG
-> Input Port -> ExecuteSQL

In this flow, the Primary Node will do all the work of generating all the SQL 
for all the pages, then will distributed the SQL among the cluster. So each 
node may be grabbing different pages from the same table, etc.  Depending on 
how much work it takes to generate the SQL, this may not be as performant as 
the first flow.  Alternatively you can distribute the SQL generation and the 
execution:

ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> 
GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL

This might be overkill but does "fully" parallelize the work. In addition, as 
mentioned, you can set multiple concurrent tasks for ExecuteSQL (but not 
GenerateTableFetch) to achieve concurrency for fetching. One thing to watch out 
for in all cases is the Max Connections property for the DBCPConnectionPool. 
Each node will get its own pool, but depending on how much is going through GTF 
and ExecuteSQL, you may run out of connections (which will slow your
throughput) or if Max Connections is high, you may exhaust all connections at 
the server, just something to keep in mind when configuring the flow.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-2881
On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <walter....@ns.nl> wrote:
>
> Hi,
>
>
>
> I’m trying to find a good strategy for distributing work among a cluster when 
> we’re fetching data from a database. My developers are currently doing 
> GenerateTableFetch and executing it only on the primary node because 
> “otherwise we end up with duplicate data”. A little googling on my end and I 
> found out about the List/Fetch pattern. All the examples are for SFTP though.
>
>
>
> I’m wondering what a good configuration might be if you’re looking to use 
> this pattern for fetching from a database. I’ve found GenerateTableFetch, and 
> I can certainly use this, but since we’re querying multiple tables (but not 
> all tables in the DB!) I’m hoping to use something like ListDatabaseTables 
> before that, so that GenerateTableFetch can be done on the whole cluster and 
> then QueryDatabaseTable as well.
>
>
>
> So one option is Multiple GenerateTableFetch processors > Funnel > RPG // 
> Input port > QueryDatabaseTable. I’m wondering if there’s also a good way to 
> go this route: ListDatabaseTables > RPG // Input port > GenerateTableFetch > 
> QueryDatabaseTable. I want to distribute as much work as possible within the 
> cluster.
>
>
>
> Kind regards,
>
>
>
> Walter
>
>
> ________________________________
>
> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke 
> informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of 
> verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan 
> derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde 
> geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond 
> hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) 
> te vernietigen.
>
> Informatie vennootschap

________________________________

Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik 
door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie 
bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking 
van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is 
uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, 
wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de 
hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen.

Informatie vennootschap<http://www.ns.nl/emaildisclaimer>

RE: List/Fetch pattern for QueryDatabaseTable

Reply via email to