Re: Data Load

2017-05-02 Thread Matt Burgess
Anil,

Is this a "one-time" migration, meaning you would take the on-prem
tables and put them on the cloud DB just once? Or would this be an
incremental operation, where you do the initial move and then take any
"new" rows from the source and apply them to the target?  For the
latter, there are a couple of options:

1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
[processors] -> PutSQL after perhaps deleting your target
DB/tables/etc.  This could be time-consuming and expensive. The
processors in-between probably include ConvertAvroToJSON and
ConvertJSONToSQL.
2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
the source data. For this your table would need a column whose values
always increase, that column would comprise the value of the "Maximum
Value Column" property in the aforementioned processors' configuration
dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
for each table in your DB.

In addition to these current solutions, as of the upcoming NiFi 1.2.0
release, you have the following options:
3) If the source database is MySQL, you can use the CaptureChangeMySQL
processor to get binary log events flowing through various processors
into PutDatabaseRecord to place them at the source. This pattern is
true Change Data Capture (CDC) versus the other two options above.
4) Option #2 will be improved such that GenerateTableFetch will accept
incoming flow files, so you can use ListDatabaseTables ->
GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
flow.

If this is a one-time migration, a data flow tool might not be the
best choice, you could consider something like Flyway [1] instead.

Regards,
Matt

[1] https://flywaydb.org/documentation/command/migrate

On Tue, May 2, 2017 at 10:41 AM, Anil Rai  wrote:
> I have a simple use case.
>
> DB (On Premise) and DB (On Cloud).
>
> I want to use nifi to extract data from on prem DB (huge volumes) and
> insert into the same table structure that is hosted on cloud.
>
> I could use ExecuteSQL on both sides of the fence (to extract from on prem
> and insert onto cloud). What processors are needed in between (if at all)?
> As i am not doing any transformations at allit is just extract and load
> use case


Re: Data Load

2017-05-02 Thread Anil Rai
Thanks Matt for the quick reply. We are using nifi 1.0 release as of now.
It's a postgres DB on both sides (on prem and on cloud)
and yes incremental load is what i am looking for.
so with that, you recommend # 2 option?

On Tue, May 2, 2017 at 11:00 AM, Matt Burgess  wrote:

> Anil,
>
> Is this a "one-time" migration, meaning you would take the on-prem
> tables and put them on the cloud DB just once? Or would this be an
> incremental operation, where you do the initial move and then take any
> "new" rows from the source and apply them to the target?  For the
> latter, there are a couple of options:
>
> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
> [processors] -> PutSQL after perhaps deleting your target
> DB/tables/etc.  This could be time-consuming and expensive. The
> processors in-between probably include ConvertAvroToJSON and
> ConvertJSONToSQL.
> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
> the source data. For this your table would need a column whose values
> always increase, that column would comprise the value of the "Maximum
> Value Column" property in the aforementioned processors' configuration
> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
> for each table in your DB.
>
> In addition to these current solutions, as of the upcoming NiFi 1.2.0
> release, you have the following options:
> 3) If the source database is MySQL, you can use the CaptureChangeMySQL
> processor to get binary log events flowing through various processors
> into PutDatabaseRecord to place them at the source. This pattern is
> true Change Data Capture (CDC) versus the other two options above.
> 4) Option #2 will be improved such that GenerateTableFetch will accept
> incoming flow files, so you can use ListDatabaseTables ->
> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
> flow.
>
> If this is a one-time migration, a data flow tool might not be the
> best choice, you could consider something like Flyway [1] instead.
>
> Regards,
> Matt
>
> [1] https://flywaydb.org/documentation/command/migrate
>
> On Tue, May 2, 2017 at 10:41 AM, Anil Rai  wrote:
> > I have a simple use case.
> >
> > DB (On Premise) and DB (On Cloud).
> >
> > I want to use nifi to extract data from on prem DB (huge volumes) and
> > insert into the same table structure that is hosted on cloud.
> >
> > I could use ExecuteSQL on both sides of the fence (to extract from on
> prem
> > and insert onto cloud). What processors are needed in between (if at
> all)?
> > As i am not doing any transformations at allit is just extract and
> load
> > use case
>


Re: Data Load

2017-05-02 Thread Matt Burgess
Yes that sounds like your best bet, assuming you have the "Maximum
Value Column" present in the table you want to migrate.  Then a flow
might look like:

QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL

In this flow the target tables would need to be created beforehand.
You might be able to do that with pg_dump or with some fancy SQL that
you could send to PutSQL in a separate (do-ahead) flow [1].  For
multiple tables, you will need one QueryDatabaseTable for each table;
depending on the number of tables and the latency for getting/putting
rows, you may be able to share the downstream processing. If that
creates a bottleneck, you may want a copy of the above flow for each
table.  This is drastically improved in NiFi 1.2.0, as you can use
ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port ->
ExecuteSQL to perform the migration in parallel across a NiFi cluster.

Regards,
Matt

[1] 
https://serverfault.com/questions/231952/is-there-a-mysql-equivalent-of-show-create-table-in-postgres


On Tue, May 2, 2017 at 11:18 AM, Anil Rai  wrote:
> Thanks Matt for the quick reply. We are using nifi 1.0 release as of now.
> It's a postgres DB on both sides (on prem and on cloud)
> and yes incremental load is what i am looking for.
> so with that, you recommend # 2 option?
>
> On Tue, May 2, 2017 at 11:00 AM, Matt Burgess  wrote:
>
>> Anil,
>>
>> Is this a "one-time" migration, meaning you would take the on-prem
>> tables and put them on the cloud DB just once? Or would this be an
>> incremental operation, where you do the initial move and then take any
>> "new" rows from the source and apply them to the target?  For the
>> latter, there are a couple of options:
>>
>> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
>> [processors] -> PutSQL after perhaps deleting your target
>> DB/tables/etc.  This could be time-consuming and expensive. The
>> processors in-between probably include ConvertAvroToJSON and
>> ConvertJSONToSQL.
>> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
>> the source data. For this your table would need a column whose values
>> always increase, that column would comprise the value of the "Maximum
>> Value Column" property in the aforementioned processors' configuration
>> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
>> for each table in your DB.
>>
>> In addition to these current solutions, as of the upcoming NiFi 1.2.0
>> release, you have the following options:
>> 3) If the source database is MySQL, you can use the CaptureChangeMySQL
>> processor to get binary log events flowing through various processors
>> into PutDatabaseRecord to place them at the source. This pattern is
>> true Change Data Capture (CDC) versus the other two options above.
>> 4) Option #2 will be improved such that GenerateTableFetch will accept
>> incoming flow files, so you can use ListDatabaseTables ->
>> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
>> flow.
>>
>> If this is a one-time migration, a data flow tool might not be the
>> best choice, you could consider something like Flyway [1] instead.
>>
>> Regards,
>> Matt
>>
>> [1] https://flywaydb.org/documentation/command/migrate
>>
>> On Tue, May 2, 2017 at 10:41 AM, Anil Rai  wrote:
>> > I have a simple use case.
>> >
>> > DB (On Premise) and DB (On Cloud).
>> >
>> > I want to use nifi to extract data from on prem DB (huge volumes) and
>> > insert into the same table structure that is hosted on cloud.
>> >
>> > I could use ExecuteSQL on both sides of the fence (to extract from on
>> prem
>> > and insert onto cloud). What processors are needed in between (if at
>> all)?
>> > As i am not doing any transformations at allit is just extract and
>> load
>> > use case
>>


Re: Data Load

2017-05-03 Thread Anil Rai
Hi Matt,

I quickly developed this and this is how i could do this

DataLake<-ExecuteSQL->ConvertAveroToJson->SplitJson->EvaluateJsonPath->ReplaceText->PutSQL->Postgres(onCloud)

The problem is, this will not scale for huge volumes. Any thoughts?

Regards
Anil


On Tue, May 2, 2017 at 12:07 PM, Matt Burgess  wrote:

> Yes that sounds like your best bet, assuming you have the "Maximum
> Value Column" present in the table you want to migrate.  Then a flow
> might look like:
>
> QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL
>
> In this flow the target tables would need to be created beforehand.
> You might be able to do that with pg_dump or with some fancy SQL that
> you could send to PutSQL in a separate (do-ahead) flow [1].  For
> multiple tables, you will need one QueryDatabaseTable for each table;
> depending on the number of tables and the latency for getting/putting
> rows, you may be able to share the downstream processing. If that
> creates a bottleneck, you may want a copy of the above flow for each
> table.  This is drastically improved in NiFi 1.2.0, as you can use
> ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port ->
> ExecuteSQL to perform the migration in parallel across a NiFi cluster.
>
> Regards,
> Matt
>
> [1] https://serverfault.com/questions/231952/is-there-a-
> mysql-equivalent-of-show-create-table-in-postgres
>
>
> On Tue, May 2, 2017 at 11:18 AM, Anil Rai  wrote:
> > Thanks Matt for the quick reply. We are using nifi 1.0 release as of now.
> > It's a postgres DB on both sides (on prem and on cloud)
> > and yes incremental load is what i am looking for.
> > so with that, you recommend # 2 option?
> >
> > On Tue, May 2, 2017 at 11:00 AM, Matt Burgess 
> wrote:
> >
> >> Anil,
> >>
> >> Is this a "one-time" migration, meaning you would take the on-prem
> >> tables and put them on the cloud DB just once? Or would this be an
> >> incremental operation, where you do the initial move and then take any
> >> "new" rows from the source and apply them to the target?  For the
> >> latter, there are a couple of options:
> >>
> >> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
> >> [processors] -> PutSQL after perhaps deleting your target
> >> DB/tables/etc.  This could be time-consuming and expensive. The
> >> processors in-between probably include ConvertAvroToJSON and
> >> ConvertJSONToSQL.
> >> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
> >> the source data. For this your table would need a column whose values
> >> always increase, that column would comprise the value of the "Maximum
> >> Value Column" property in the aforementioned processors' configuration
> >> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
> >> for each table in your DB.
> >>
> >> In addition to these current solutions, as of the upcoming NiFi 1.2.0
> >> release, you have the following options:
> >> 3) If the source database is MySQL, you can use the CaptureChangeMySQL
> >> processor to get binary log events flowing through various processors
> >> into PutDatabaseRecord to place them at the source. This pattern is
> >> true Change Data Capture (CDC) versus the other two options above.
> >> 4) Option #2 will be improved such that GenerateTableFetch will accept
> >> incoming flow files, so you can use ListDatabaseTables ->
> >> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
> >> flow.
> >>
> >> If this is a one-time migration, a data flow tool might not be the
> >> best choice, you could consider something like Flyway [1] instead.
> >>
> >> Regards,
> >> Matt
> >>
> >> [1] https://flywaydb.org/documentation/command/migrate
> >>
> >> On Tue, May 2, 2017 at 10:41 AM, Anil Rai 
> wrote:
> >> > I have a simple use case.
> >> >
> >> > DB (On Premise) and DB (On Cloud).
> >> >
> >> > I want to use nifi to extract data from on prem DB (huge volumes) and
> >> > insert into the same table structure that is hosted on cloud.
> >> >
> >> > I could use ExecuteSQL on both sides of the fence (to extract from on
> >> prem
> >> > and insert onto cloud). What processors are needed in between (if at
> >> all)?
> >> > As i am not doing any transformations at allit is just extract and
> >> load
> >> > use case
> >>
>


Re: Data Load

2017-05-03 Thread Matt Burgess
Anil,

When you say huge volumes, do you mean a large number of tables, or large 
tables, or both?

For a large number of tables, you will likely want to upgrade to the upcoming 
NiFi release so you can use ListDatabaseTables -> GenerateTableFetch -> 
ExecuteSQL for the source part, although in the meantime you could use 
ListDatabaseTables -> ReplaceText (to set the SELECT query) -> ExecuteSQL.

For large tables on a single NiFi instance, I recommend using 
QueryDatabaseTable. Whether you have a max-value column or not, QDT lets you 
fetch smaller batches of rows per flow file, versus ExecuteSQL which puts the 
whole result set in one flow file, which can lead to memory issues.

For scalability with large tables, I recommend a NiFi cluster of 3-10 nodes, 
using a flow of GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL for the 
source part. In 1.2.0 you'll be able to have ListDatabaseTables at the front, 
to support a large number of tables. The RPG -> Input Port part is to 
distribute the flow files across the cluster, the downstream flow is executed 
in parallel with a subset of the incoming flow files (rather than each copy of 
the flow getting every flow file).

Regards,
Matt


> On May 3, 2017, at 1:58 PM, Anil Rai  wrote:
> 
> Hi Matt,
> 
> I quickly developed this and this is how i could do this
> 
> DataLake<-ExecuteSQL->ConvertAveroToJson->SplitJson->EvaluateJsonPath->ReplaceText->PutSQL->Postgres(onCloud)
> 
> The problem is, this will not scale for huge volumes. Any thoughts?
> 
> Regards
> Anil
> 
> 
>> On Tue, May 2, 2017 at 12:07 PM, Matt Burgess  wrote:
>> 
>> Yes that sounds like your best bet, assuming you have the "Maximum
>> Value Column" present in the table you want to migrate.  Then a flow
>> might look like:
>> 
>> QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL
>> 
>> In this flow the target tables would need to be created beforehand.
>> You might be able to do that with pg_dump or with some fancy SQL that
>> you could send to PutSQL in a separate (do-ahead) flow [1].  For
>> multiple tables, you will need one QueryDatabaseTable for each table;
>> depending on the number of tables and the latency for getting/putting
>> rows, you may be able to share the downstream processing. If that
>> creates a bottleneck, you may want a copy of the above flow for each
>> table.  This is drastically improved in NiFi 1.2.0, as you can use
>> ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port ->
>> ExecuteSQL to perform the migration in parallel across a NiFi cluster.
>> 
>> Regards,
>> Matt
>> 
>> [1] https://serverfault.com/questions/231952/is-there-a-
>> mysql-equivalent-of-show-create-table-in-postgres
>> 
>> 
>>> On Tue, May 2, 2017 at 11:18 AM, Anil Rai  wrote:
>>> Thanks Matt for the quick reply. We are using nifi 1.0 release as of now.
>>> It's a postgres DB on both sides (on prem and on cloud)
>>> and yes incremental load is what i am looking for.
>>> so with that, you recommend # 2 option?
>>> 
>>> On Tue, May 2, 2017 at 11:00 AM, Matt Burgess 
>> wrote:
>>> 
 Anil,
 
 Is this a "one-time" migration, meaning you would take the on-prem
 tables and put them on the cloud DB just once? Or would this be an
 incremental operation, where you do the initial move and then take any
 "new" rows from the source and apply them to the target?  For the
 latter, there are a couple of options:
 
 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
 [processors] -> PutSQL after perhaps deleting your target
 DB/tables/etc.  This could be time-consuming and expensive. The
 processors in-between probably include ConvertAvroToJSON and
 ConvertJSONToSQL.
 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
 the source data. For this your table would need a column whose values
 always increase, that column would comprise the value of the "Maximum
 Value Column" property in the aforementioned processors' configuration
 dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
 for each table in your DB.
 
 In addition to these current solutions, as of the upcoming NiFi 1.2.0
 release, you have the following options:
 3) If the source database is MySQL, you can use the CaptureChangeMySQL
 processor to get binary log events flowing through various processors
 into PutDatabaseRecord to place them at the source. This pattern is
 true Change Data Capture (CDC) versus the other two options above.
 4) Option #2 will be improved such that GenerateTableFetch will accept
 incoming flow files, so you can use ListDatabaseTables ->
 GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
 flow.
 
 If this is a one-time migration, a data flow tool might not be the
 best choice, you could consider something like Flyway [1] instead.
 
 Regards,
 Matt
 
 [1] https://flywaydb.org/documen