Hello James,
To stream real time data out of the database. You need to spin up a CDC
instance. For example, Debezium[1]. With the CDC engine, it streams out changed
data to Kafka (for example). You can consume the message from Kafka using
FlinkKafkaConsumer.
For history data, it could be considered as a bounded data stream processing
using JDBC connector.
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/debezium/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/jdbc/
Thanks,
Fuyao
From: James Sandys-Lumsdaine
Date: Tuesday, August 10, 2021 at 07:58
To: user@flink.apache.org
Subject: [External] : Questions on reading JDBC data with Flink Streaming API
Hello,
I'm starting a new Flink application to allow my company to perform lots of
reporting. We have an existing legacy system with most the data we need held in
SQL Server databases. We will need to consume data from these databases
initially before starting to consume more data from newly deployed Kafka
streams.
I've spent a lot of time reading the Flink book and web pages but I have some
simple questions and assumptions I hope you can help with so I can progress.
Firstly, I am wanting to use the DataStream API so we can both consume historic
data and also realtime data. I do not think I want to use the DataSet API but I
also don't see the point in using the SQL/Table apis as I would prefer to write
my functions in Java classes. I need to maintain my own state and it seems
DataStream keyed functions are the way to go.
Now I am trying to actually write code against our production databases I need
to be able to read in "streams" of data with SQL queries - there does not
appear to be a JDBC source connector so I think I have to make the JDBC call
myself and then possibly create a DataSource using env.fromElements().
Obviously this is a "bounded" data set but how else am I meant to get historic
data loaded in? In the future I want to include a Kafka stream as well which
will only have a few weeks worth of data so I imagine I will sometimes need to
merge data from a SQL Server/Snowflake database with a live stream from a Kafka
stream. What is the best practice for this as I don't see examples discussing
this.
With retrieving data from a JDBC source, I have also seen some examples using a
StreamingTableEnvironment - am I meant to use this somehow instead to query
data from a JDBC connection into my DataStream functions etc? Again, I want to
write my functions in Java not some Flink SQL. Is it best practice to use a
StreamingTableEnvironment to query JDBC data if I'm only using the DataStream
API?
Thanks in advance - I'm sure I will have plenty more high-level questions like
this.