Re: Questions on reading JDBC data with Flink Streaming API

2021-08-10 Thread Fuyao Li
Hello James,

To stream real time data out of the database. You need to spin up a CDC 
instance. For example, Debezium[1]. With the CDC engine, it streams out changed 
data to Kafka (for example). You can consume the message from Kafka using 
FlinkKafkaConsumer.
For history data, it could be considered as a bounded data stream processing 
using JDBC connector.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/debezium/
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/jdbc/

Thanks,
Fuyao

From: James Sandys-Lumsdaine 
Date: Tuesday, August 10, 2021 at 07:58
To: user@flink.apache.org 
Subject: [External] : Questions on reading JDBC data with Flink Streaming API

Hello,



I'm starting a new Flink application to allow my company to perform lots of 
reporting. We have an existing legacy system with most the data we need held in 
SQL Server databases. We will need to consume data from these databases 
initially before starting to consume more data from newly deployed Kafka 
streams.



I've spent a lot of time reading the Flink book and web pages but I have some 
simple questions and assumptions I hope you can help with so I can progress.



Firstly, I am wanting to use the DataStream API so we can both consume historic 
data and also realtime data. I do not think I want to use the DataSet API but I 
also don't see the point in using the SQL/Table apis as I would prefer to write 
my functions in Java classes. I need to maintain my own state and it seems 
DataStream keyed functions are the way to go.



Now I am trying to actually write code against our production databases I need 
to be able to read in "streams" of data with SQL queries - there does not 
appear to be a JDBC source connector so I think I have to make the JDBC call 
myself and then possibly create a DataSource using env.fromElements(). 
Obviously this is a "bounded" data set but how else am I meant to get historic 
data loaded in? In the future I want to include a Kafka stream as well which 
will only have a few weeks worth of data so I imagine I will sometimes need to 
merge data from a SQL Server/Snowflake database with a live stream from a Kafka 
stream. What is the best practice for this as I don't see examples discussing 
this.



With retrieving data from a JDBC source, I have also seen some examples using a 
StreamingTableEnvironment - am I meant to use this somehow instead to query 
data from a JDBC connection into my DataStream functions etc? Again, I want to 
write my functions in Java not some Flink SQL. Is it best practice to use a 
StreamingTableEnvironment to query JDBC data if I'm only using the DataStream 
API?



Thanks in advance - I'm sure I will have plenty more high-level questions like 
this.



Questions on reading JDBC data with Flink Streaming API

2021-08-10 Thread James Sandys-Lumsdaine
Hello,



I'm starting a new Flink application to allow my company to perform lots of 
reporting. We have an existing legacy system with most the data we need held in 
SQL Server databases. We will need to consume data from these databases 
initially before starting to consume more data from newly deployed Kafka 
streams.



I've spent a lot of time reading the Flink book and web pages but I have some 
simple questions and assumptions I hope you can help with so I can progress.



Firstly, I am wanting to use the DataStream API so we can both consume historic 
data and also realtime data. I do not think I want to use the DataSet API but I 
also don't see the point in using the SQL/Table apis as I would prefer to write 
my functions in Java classes. I need to maintain my own state and it seems 
DataStream keyed functions are the way to go.



Now I am trying to actually write code against our production databases I need 
to be able to read in "streams" of data with SQL queries - there does not 
appear to be a JDBC source connector so I think I have to make the JDBC call 
myself and then possibly create a DataSource using env.fromElements(). 
Obviously this is a "bounded" data set but how else am I meant to get historic 
data loaded in? In the future I want to include a Kafka stream as well which 
will only have a few weeks worth of data so I imagine I will sometimes need to 
merge data from a SQL Server/Snowflake database with a live stream from a Kafka 
stream. What is the best practice for this as I don't see examples discussing 
this.



With retrieving data from a JDBC source, I have also seen some examples using a 
StreamingTableEnvironment - am I meant to use this somehow instead to query 
data from a JDBC connection into my DataStream functions etc? Again, I want to 
write my functions in Java not some Flink SQL. Is it best practice to use a 
StreamingTableEnvironment to query JDBC data if I'm only using the DataStream 
API?



Thanks in advance - I'm sure I will have plenty more high-level questions like 
this.