[jira] [Commented] (NIFI-4385) Adjust the QueryDatabaseTable processor for handling big tables.

Peter Wicks (JIRA) Tue, 03 Oct 2017 18:28:58 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190648#comment-16190648
 ]


Peter Wicks commented on NIFI-4385:
-----------------------------------

[~tspaeth] I've thought off and on for a while about adding this feature. Kind 
of like a 'dirty FlowFile' read, but there are issues that need to be tackled.

One of the reason I haven't done it before is because if your connection to the 
database gets interrupted, or NiFi restarts, then the rows you already sent out 
will not have been accounted for in QueryDatabaseTable. The obvious fix for 
this is to update the state of QueryDatabaseTable with the latest max values 
right after you send each file, thus reducing the chance of a duplicate file; 
but...

Right now QueryDatabaseTable does not put an ORDER BY on the SELECT query. 
Which means your max value after any given FlowFile has no bearing on the whole 
ResultSet. This makes sense, why add the performance overhead of an ORDER BY if 
you don't need it. So it's important to understand that there would be some 
additional performance overhead by adding an ORDER BY as part of this change 
too. but...

Then there is the duplicate max value issue to deal with. There will be 
problems if your max value column is not a sequential ID number, but is instead 
a value where duplicates can occur, specifically I'm thinking about time 
stamps. Lets say each FlowFile will be released based on the existing *Max Rows 
Per Flow File* attribute. So we output a FlowFile every 10,000 rows, but two 
rows with the same max value end up split across FlowFile's... and then NiFi 
gets restarted. Now there is no correct max value, some records with the 
current max value have been loaded and some haven't.

This problem can also be handled, but it requires continuing to read rows from 
the ResultSet until the max value column increases. There would be no real way 
to guarantee how many rows would be in a FlowFile, be it 10 or 1M rows. I was 
recently working on some SQL for a NiFi change data capture job and found that 
every few days a batch job would run a single transactions updating 100k rows, 
all with the same update timestamp, into my table...

Thoughts?

> Adjust the QueryDatabaseTable processor for handling big tables.
> ----------------------------------------------------------------
>
>                 Key: NIFI-4385
>                 URL: https://issues.apache.org/jira/browse/NIFI-4385
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.3.0
>            Reporter: Tim Späth
>
> When querying large database tables, the *QueryDatabaseTable* processor does 
> not perform very well.
> The processor will always perform the full query and then transfer all 
> flowfiles as a list instead of 
> transferring them particularly after the *ResultSet* is fetching the next 
> rows(If a fetch size is given). 
> If you want to query a billion rows from a table, 
> the processor will add all flowfiles in an ArrayList<FlowFile> in memory 
> before transferring the whole list after the last row is fetched by the 
> ResultSet. 
> I've checked the code in 
> *org.apache.nifi.processors.standard.QueryDatabaseTable.java* 
> and in my opinion, it would be no big deal to move the session.transfer to a 
> proper position in the code (into the while loop where the flowfile is added 
> to the list) to 
> achieve a real _stream support_. There was also a bug report for this problem 
> which resulted in adding the new property *Maximum Number of Fragments*, 
> but this property will just limit the results. 
> Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of 
> Fragments* to get your limit, 
> which is not really a solution for the original problem imho. 
> Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is 
> much slower than using a database cursor or a ResultSet
> and stream the rows in flowfiles directly in the queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NIFI-4385) Adjust the QueryDatabaseTable processor for handling big tables.

Reply via email to