Hi Marco,

The DataSet API will not run out of memory, as it spills to disk if the
data doesn't fit anymore.
Load is distributed by partitioning data.

Giving you advice depends a bit on the use-case. I would explore two major
options:
a) reading the data from postgres using Flink's SQL JDBC connector [1]. 200
GB is not much data. A 1gb network link needs ~30 minutes to transfer that
(125 megabytes / second)
b) Using the DataSet API and state processor API. I would first try to see
how much effort it is to read the data using the DataSet API (could be less
convenient than the Flink SQL JDBC connector).

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/jdbc/


On Wed, Jun 16, 2021 at 6:50 AM Marco Villalobos <mvillalo...@kineteque.com>
wrote:

> I must bootstrap state from postgres (approximately 200 GB of data) and I
> notice that the state processor API requires the DataSet API in order to
> bootstrap state for the Stream API.
>
> I wish there was a way to use the SQL API and use a partitioned scan, but
> I don't know if that is even possible with the DataSet API.
>
> I never used the DataSet API, and I am unsure how it manages memory, or
> distributes load, when handling large state.
>
> Would it run out of memory if I map data from a JDBCInputFormat into a
> large DataSet and then use that to bootstrap state for my stream job?
>
> Any advice on how I should proceed with this would be greatly appreciated.
>
> Thank you.
>

Reply via email to