Re: Please advise bootstrapping large state

Robert Metzger Wed, 16 Jun 2021 04:56:02 -0700

Hi Marco,

The DataSet API will not run out of memory, as it spills to disk if the
data doesn't fit anymore.
Load is distributed by partitioning data.

Giving you advice depends a bit on the use-case. I would explore two major
options:
a) reading the data from postgres using Flink's SQL JDBC connector [1]. 200
GB is not much data. A 1gb network link needs ~30 minutes to transfer that
(125 megabytes / second)
b) Using the DataSet API and state processor API. I would first try to see
how much effort it is to read the data using the DataSet API (could be less
convenient than the Flink SQL JDBC connector).

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/jdbc/

On Wed, Jun 16, 2021 at 6:50 AM Marco Villalobos <mvillalo...@kineteque.com>
wrote:

> I must bootstrap state from postgres (approximately 200 GB of data) and I
> notice that the state processor API requires the DataSet API in order to
> bootstrap state for the Stream API.
>
> I wish there was a way to use the SQL API and use a partitioned scan, but
> I don't know if that is even possible with the DataSet API.
>
> I never used the DataSet API, and I am unsure how it manages memory, or
> distributes load, when handling large state.
>
> Would it run out of memory if I map data from a JDBCInputFormat into a
> large DataSet and then use that to bootstrap state for my stream job?
>
> Any advice on how I should proceed with this would be greatly appreciated.
>
> Thank you.
>

Re: Please advise bootstrapping large state

Reply via email to