Re: Data Load

Matt Burgess Wed, 03 May 2017 12:48:07 -0700

Anil,

When you say huge volumes, do you mean a large number of tables, or large 
tables, or both?


For a large number of tables, you will likely want to upgrade to the upcoming 
NiFi release so you can use ListDatabaseTables -> GenerateTableFetch -> 
ExecuteSQL for the source part, although in the meantime you could use 
ListDatabaseTables -> ReplaceText (to set the SELECT query) -> ExecuteSQL.

For large tables on a single NiFi instance, I recommend using 
QueryDatabaseTable. Whether you have a max-value column or not, QDT lets you 
fetch smaller batches of rows per flow file, versus ExecuteSQL which puts the 
whole result set in one flow file, which can lead to memory issues.

For scalability with large tables, I recommend a NiFi cluster of 3-10 nodes, 
using a flow of GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL for the 
source part. In 1.2.0 you'll be able to have ListDatabaseTables at the front, 
to support a large number of tables. The RPG -> Input Port part is to 
distribute the flow files across the cluster, the downstream flow is executed 
in parallel with a subset of the incoming flow files (rather than each copy of 
the flow getting every flow file).

Regards,
Matt


> On May 3, 2017, at 1:58 PM, Anil Rai <anilrain...@gmail.com> wrote:
> 
> Hi Matt,
> 
> I quickly developed this and this is how i could do this
> 
> DataLake<-ExecuteSQL->ConvertAveroToJson->SplitJson->EvaluateJsonPath->ReplaceText->PutSQL->Postgres(onCloud)
> 
> The problem is, this will not scale for huge volumes. Any thoughts?
> 
> Regards
> Anil
> 
> 
>> On Tue, May 2, 2017 at 12:07 PM, Matt Burgess <mattyb...@apache.org> wrote:
>> 
>> Yes that sounds like your best bet, assuming you have the "Maximum
>> Value Column" present in the table you want to migrate.  Then a flow
>> might look like:
>> 
>> QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL
>> 
>> In this flow the target tables would need to be created beforehand.
>> You might be able to do that with pg_dump or with some fancy SQL that
>> you could send to PutSQL in a separate (do-ahead) flow [1].  For
>> multiple tables, you will need one QueryDatabaseTable for each table;
>> depending on the number of tables and the latency for getting/putting
>> rows, you may be able to share the downstream processing. If that
>> creates a bottleneck, you may want a copy of the above flow for each
>> table.  This is drastically improved in NiFi 1.2.0, as you can use
>> ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port ->
>> ExecuteSQL to perform the migration in parallel across a NiFi cluster.
>> 
>> Regards,
>> Matt
>> 
>> [1] https://serverfault.com/questions/231952/is-there-a-
>> mysql-equivalent-of-show-create-table-in-postgres
>> 
>> 
>>> On Tue, May 2, 2017 at 11:18 AM, Anil Rai <anilrain...@gmail.com> wrote:
>>> Thanks Matt for the quick reply. We are using nifi 1.0 release as of now.
>>> It's a postgres DB on both sides (on prem and on cloud)
>>> and yes incremental load is what i am looking for.....
>>> so with that, you recommend # 2 option?
>>> 
>>> On Tue, May 2, 2017 at 11:00 AM, Matt Burgess <mattyb...@apache.org>
>> wrote:
>>> 
>>>> Anil,
>>>> 
>>>> Is this a "one-time" migration, meaning you would take the on-prem
>>>> tables and put them on the cloud DB just once? Or would this be an
>>>> incremental operation, where you do the initial move and then take any
>>>> "new" rows from the source and apply them to the target?  For the
>>>> latter, there are a couple of options:
>>>> 
>>>> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
>>>> [processors] -> PutSQL after perhaps deleting your target
>>>> DB/tables/etc.  This could be time-consuming and expensive. The
>>>> processors in-between probably include ConvertAvroToJSON and
>>>> ConvertJSONToSQL.
>>>> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
>>>> the source data. For this your table would need a column whose values
>>>> always increase, that column would comprise the value of the "Maximum
>>>> Value Column" property in the aforementioned processors' configuration
>>>> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
>>>> for each table in your DB.
>>>> 
>>>> In addition to these current solutions, as of the upcoming NiFi 1.2.0
>>>> release, you have the following options:
>>>> 3) If the source database is MySQL, you can use the CaptureChangeMySQL
>>>> processor to get binary log events flowing through various processors
>>>> into PutDatabaseRecord to place them at the source. This pattern is
>>>> true Change Data Capture (CDC) versus the other two options above.
>>>> 4) Option #2 will be improved such that GenerateTableFetch will accept
>>>> incoming flow files, so you can use ListDatabaseTables ->
>>>> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
>>>> flow.
>>>> 
>>>> If this is a one-time migration, a data flow tool might not be the
>>>> best choice, you could consider something like Flyway [1] instead.
>>>> 
>>>> Regards,
>>>> Matt
>>>> 
>>>> [1] https://flywaydb.org/documentation/command/migrate
>>>> 
>>>> On Tue, May 2, 2017 at 10:41 AM, Anil Rai <anilrain...@gmail.com>
>> wrote:
>>>>> I have a simple use case.
>>>>> 
>>>>> DB (On Premise) and DB (On Cloud).
>>>>> 
>>>>> I want to use nifi to extract data from on prem DB (huge volumes) and
>>>>> insert into the same table structure that is hosted on cloud.
>>>>> 
>>>>> I could use ExecuteSQL on both sides of the fence (to extract from on
>>>> prem
>>>>> and insert onto cloud). What processors are needed in between (if at
>>>> all)?
>>>>> As i am not doing any transformations at all....it is just extract and
>>>> load
>>>>> use case
>>>> 
>>

Re: Data Load

Reply via email to