Hi Matt, Thanks for that very thorough explanation. I'll be sure to present this to my dev's in order to come to a way of working. When it comes to ListDatabaseTables, how can I use this when I don't want to query all tables from the database? Let's say there's some 30 tables in the database and I only want to query 8 of them. I realize there's a parameter called 'Table Name Pattern' in the processor, but how do I put together a pattern when the tables I'm targeting have very different names? Also, it seems a bit silly to open a database connections just to list the specific tables I'm looking to query (which I already know the names of). Is there a processor I can use where I can just fill out this list myself?
Perhaps these are very basic questions, by the way. Everybody here is learning how to work with NiFi and from an admin perspective I'm trying to provide some guiding on best practices for common flows. Kind regards, Walter -----Oorspronkelijk bericht----- Van: Matt Burgess [mailto:mattyb...@apache.org] Verzonden: donderdag 16 augustus 2018 15:55 Aan: users@nifi.apache.org Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable Walter, If you're looking to distribute database fetching among a cluster, then GenerateTableFetch is the right choice (over QueryDatabaseTable). As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts incoming flow files, the capability was added in response to exactly the use case you outlined, distributed fetch of multiple tables via ListDatabaseTables. You still want your source processor to run on the Primary Node Only, otherwise all nodes get the same source data and as you said, you end up with duplicate data. QueryDatabaseTable does not accept incoming connections, but you can use ExecuteSQL to actually do the fetching. To distributed fetching of tables among the cluster, I recommend the following flow: ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> GenerateTableFetch -> ExecuteSQL Each node on the cluster will get flow files for the various tables in the database, then GenerateTableFetch will generate the SQL to fetch "pages" based on the Partition Size property, then ExecuteSQL will execute the statements. You can use multiple concurrent tasks for ExecuteSQL to perform the fetching concurrently for the SQL statements, and the RPG->Input port part will let you do the tables in parallel. If instead you want to fully distribute the SQL execution (fetching various pages from various tables), you could move the RPG->Input Port after GTF: ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL In this flow, the Primary Node will do all the work of generating all the SQL for all the pages, then will distributed the SQL among the cluster. So each node may be grabbing different pages from the same table, etc. Depending on how much work it takes to generate the SQL, this may not be as performant as the first flow. Alternatively you can distribute the SQL generation and the execution: ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL This might be overkill but does "fully" parallelize the work. In addition, as mentioned, you can set multiple concurrent tasks for ExecuteSQL (but not GenerateTableFetch) to achieve concurrency for fetching. One thing to watch out for in all cases is the Max Connections property for the DBCPConnectionPool. Each node will get its own pool, but depending on how much is going through GTF and ExecuteSQL, you may run out of connections (which will slow your throughput) or if Max Connections is high, you may exhaust all connections at the server, just something to keep in mind when configuring the flow. Regards, Matt [1] https://issues.apache.org/jira/browse/NIFI-2881 On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <walter....@ns.nl> wrote: > > Hi, > > > > I’m trying to find a good strategy for distributing work among a cluster when > we’re fetching data from a database. My developers are currently doing > GenerateTableFetch and executing it only on the primary node because > “otherwise we end up with duplicate data”. A little googling on my end and I > found out about the List/Fetch pattern. All the examples are for SFTP though. > > > > I’m wondering what a good configuration might be if you’re looking to use > this pattern for fetching from a database. I’ve found GenerateTableFetch, and > I can certainly use this, but since we’re querying multiple tables (but not > all tables in the DB!) I’m hoping to use something like ListDatabaseTables > before that, so that GenerateTableFetch can be done on the whole cluster and > then QueryDatabaseTable as well. > > > > So one option is Multiple GenerateTableFetch processors > Funnel > RPG // > Input port > QueryDatabaseTable. I’m wondering if there’s also a good way to > go this route: ListDatabaseTables > RPG // Input port > GenerateTableFetch > > QueryDatabaseTable. I want to distribute as much work as possible within the > cluster. > > > > Kind regards, > > > > Walter > > > ________________________________ > > Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor > (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke > informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of > verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan > derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde > geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond > hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) > te vernietigen. > > Informatie vennootschap ________________________________ Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen. Informatie vennootschap<http://www.ns.nl/emaildisclaimer>