Hi, Thank you for your reply.
We are using ExecuteSQLRecord processor with DBCPConnectionPool to pull the data from Oracle DB Oracle DB has terabytes of data spread across not uniformly meaning some time intervals (days/hour/minutes) have more data relative to other points in time. This has made setting the parameters of said processor challenging since we wanted to optimize it. Currently the output batch size is set to 1000, max rows per flow file is set to 100 and fetch size is set to 100. And it's incrementally pulling a 6 minute time interval at a time. For example, from 00:00:00 to 00:06:00 then from 00:06:00 to 00:12:00 and so on. Initially the bottleneck appeared to be at the pull so we added more threads/concurrency to it so now it pulls multiple intervals concurrently but this has caused a bottleneck downstream since the transformer processor could not keep up. We are looking at fine tuning the backpressure, distribution of threads and perhaps redesigning the flow. ControlRate processor may be of help also. Best, Eric On Wed, Apr 9, 2025 at 2:38 AM Etienne Jouvin <[email protected]> wrote: > Hi, > > Hard to give advice for me with this description. > > But how do you pull information from your database ? > I mean, did you take a look to the parameter Output batch size ? > > Do you need to have only one FlowFile for the executed query or can you > split results , like 1000 rows per FlowFile ? > > > > Le mar. 8 avr. 2025 à 16:30, Eric An <[email protected]> a écrit : > >> Hi, >> >> We are on version 1.23.2 and have some questions surrounding the ETL data >> pipeline. >> >> It connects to the Oracle DB to extract/pull data incrementally >> (minutes/hours), does transformations and loads to S3. >> >> However, we are seeing a bottleneck at Oracle/pulling the data, so when we >> assign more threads to that it creates a bottleneck at the transformation >> stage since it's monopolizing the threads. >> >> Is there a way to dynamically assign the threads? The data in Oracle is >> not uniformly distributed so some days/hours have much more data. For >> those days/hours, having access to more threads helps tremendously >> extracting the data. >> >> Any advice/recommendation on how to approach performance tuning in this >> scenario? Should we just divy up the available threads to the extract and >> transformation processors evenly? Not sure what would be the best way to >> optimally assign the number of threads for each processor to maximize >> throughput of the pipeline. >> >> >> Best, >> Eric >> >
