Hi Marco, The DataSet API will not run out of memory, as it spills to disk if the data doesn't fit anymore. Load is distributed by partitioning data.
Giving you advice depends a bit on the use-case. I would explore two major options: a) reading the data from postgres using Flink's SQL JDBC connector [1]. 200 GB is not much data. A 1gb network link needs ~30 minutes to transfer that (125 megabytes / second) b) Using the DataSet API and state processor API. I would first try to see how much effort it is to read the data using the DataSet API (could be less convenient than the Flink SQL JDBC connector). [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/jdbc/ On Wed, Jun 16, 2021 at 6:50 AM Marco Villalobos <mvillalo...@kineteque.com> wrote: > I must bootstrap state from postgres (approximately 200 GB of data) and I > notice that the state processor API requires the DataSet API in order to > bootstrap state for the Stream API. > > I wish there was a way to use the SQL API and use a partitioned scan, but > I don't know if that is even possible with the DataSet API. > > I never used the DataSet API, and I am unsure how it manages memory, or > distributes load, when handling large state. > > Would it run out of memory if I map data from a JDBCInputFormat into a > large DataSet and then use that to bootstrap state for my stream job? > > Any advice on how I should proceed with this would be greatly appreciated. > > Thank you. >