Re: State bootstrapping for Flink SQL / Table API jobs

2023-04-26 Thread Flavio Pompermaier
This feature would be an awesome addition! I'm looking forward to it On Mon, Apr 24, 2023 at 3:59 PM Илья Соин wrote: > Thank you, Shammon FY > > -- > *Sincerely,* > *Ilya Soin* > > On 24 Apr 2023, at 15:19, Shammon FY wrote: > >  > Thanks Илья, there's already a FLIP [1] and discussion thread

Re: State bootstrapping for Flink SQL / Table API jobs

2023-04-24 Thread Илья Соин
Thank you, Shammon FY-- Sincerely,Ilya SoinOn 24 Apr 2023, at 15:19, Shammon FY wrote:Thanks Илья, there's already a FLIP [1] and discussion thread [2] about hybrid source. You can follow the progress and welcome to participate in the discussion.[1] https://cwiki.apache.org/confluence/pages/viewp

Re: State bootstrapping for Flink SQL / Table API jobs

2023-04-24 Thread Shammon FY
Thanks Илья, there's already a FLIP [1] and discussion thread [2] about hybrid source. You can follow the progress and welcome to participate in the discussion. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235836225 [2] https://lists.apache.org/thread/nbf3skopy3trtj37jcovmt

Re: State bootstrapping for Flink SQL / Table API jobs

2023-04-24 Thread Илья Соин
Hi Shammon FY,I haven’t tried it because AFIK it’s only available in the DataStream API, while our job is in SQL. I’m thinking to write a custom HybridDynamicTableSource which will use HybridSource under the hood. This should allow to bootstrap any SQL / Table API job. Maybe it’s something worth ad

Re: State bootstrapping for Flink SQL / Table API jobs

2023-04-23 Thread Shammon FY
Hi Илья I think HybridSource may be a good way. Have you tried it before? Or have you encountered any problems? Best, Shammon FY On Fri, Apr 21, 2023 at 5:59 PM Илья Соин wrote: > Hi Flink community, > > We have a quite complex sql job, it unions 5 topics, deduplicates by key > and does some d

State bootstrapping for Flink SQL / Table API jobs

2023-04-21 Thread Илья Соин
Hi Flink community, We have a quite complex sql job, it unions 5 topics, deduplicates by key and does some daily aggregations. The state TTL is 40 days. We want to be able to bootstrap its state from s3 or clickhouse. We want to have a general solution to this, to use for other SQL jobs as well

Re: Bootstrapping multiple state within same operator

2023-03-24 Thread David Artiga
FFR] = null > > > > override def snapshotState(context: FunctionSnapshotContext): Unit = { > > } > > > > override def initializeState(context: FunctionInitializationContext): > Unit = { > > val fFRStateDescriptor = new ListStateDescriptor[FFR](&

RE: Bootstrapping multiple state within same operator

2023-03-24 Thread Schwalbe Matthias
user@flink.apache.org Subject: Re: Bootstrapping multiple state within same operator ⚠EXTERNAL MESSAGE – CAUTION: Think Before You Click ⚠ Not familiar with the implementation but thinking some options: - composable transformations - underlying MultiMap - ... On Wed, Mar 22, 2023

Re: Bootstrapping multiple state within same operator

2023-03-22 Thread David Artiga
Not familiar with the implementation but thinking some options: - composable transformations - underlying MultiMap - ... On Wed, Mar 22, 2023 at 10:24 AM Hang Ruan wrote: > Hi, David, > I also read the code about the `SavepointWriter#withOperator`. The > transformations are stored in a `Map` wh

Re: Bootstrapping multiple state within same operator

2023-03-22 Thread Hang Ruan
Hi, David, I also read the code about the `SavepointWriter#withOperator`. The transformations are stored in a `Map` whose key is `OperatorID`. I don't come up with a way that we could register multi transformations for one operator with the provided API. Maybe we need a new type of `XXXStateBoots

Bootstrapping multiple state within same operator

2023-03-22 Thread David Artiga
We are using state processor API to bootstrap the state of some operators. It has been working fine until now, when

Re: Please advise bootstrapping large state

2021-06-18 Thread Marco Villalobos
It was not clear to me that JdbcInputFormat was part of the DataSet api. Now I understand. Thank you. On Fri, Jun 18, 2021 at 5:23 AM Timo Walther wrote: > Hi Marco, > > as Robert already mentioned, the BatchTableEnvironment is simply build > on top of the DataSet API, partitioning functionali

Re: Please advise bootstrapping large state

2021-06-18 Thread Timo Walther
Hi Marco, as Robert already mentioned, the BatchTableEnvironment is simply build on top of the DataSet API, partitioning functionality is also available in DataSet API. So using the JdbcInputFormat directly should work in DataSet API. Otherwise I would recommend to use some initial pipeline

Re: Please advise bootstrapping large state

2021-06-17 Thread Marco Villalobos
I need to bootstrap a keyed process function. So, I was hoping to use the Table SQL API because I thought it could parallelize the work more efficiently via partitioning. I need to boot strap keyed state for a keyed process function, with Flnk 1.12.1, thus I think I am required to use the DataSet

Re: Please advise bootstrapping large state

2021-06-17 Thread Timo Walther
Hi Marco, which operations do you want to execute in the bootstrap pipeline? Maybe you don't need to use SQL and old planner. At least this would simplify the friction by going through another API layer. The JDBC connector can be directly be used in DataSet API as well. Regards, Timo On 1

Re: Please advise bootstrapping large state

2021-06-16 Thread Marco Villalobos
Thank you very much! I tried using Flink's SQL JDBC connector, and ran into issues. According to the flink documentation, only the old planner is compatible with the DataSet API. When I connect to the table: CREATE TABLE my_table ( ) WITH ( 'connector.type' = 'jdbc', 'connector.url'

Re: Please advise bootstrapping large state

2021-06-16 Thread Robert Metzger
Hi Marco, The DataSet API will not run out of memory, as it spills to disk if the data doesn't fit anymore. Load is distributed by partitioning data. Giving you advice depends a bit on the use-case. I would explore two major options: a) reading the data from postgres using Flink's SQL JDBC connec

Please advise bootstrapping large state

2021-06-15 Thread Marco Villalobos
I must bootstrap state from postgres (approximately 200 GB of data) and I notice that the state processor API requires the DataSet API in order to bootstrap state for the Stream API. I wish there was a way to use the SQL API and use a partitioned scan, but I don't know if that is even possible wit

Re: Re: Bootstrapping the state

2018-07-23 Thread Fabian Hueske
Hi Henkka, You might want to consider implementing a dedicated job for state bootstrapping that uses the same operator UUID and state names. That might be easier than integrating the logic into your regular job. I think you have to use the monitoring file source because AFAIK it won'

Re: Re: Bootstrapping the state

2018-07-22 Thread Henri Heiskanen
Hi, With state bootstrapping I mean loading the state with initial data before starting the actual job. For example, in our case I would like to load information like registration date of our users (>5 years of data) so that I can enrich our event data in streaming (5 days retention). Bef

Re: Re: Bootstrapping the state

2018-07-20 Thread Vino yang
Hi Henkka, If you want to customize the datastream text source for your purpose. You can use a read counter, if the value of counter would not change in a interval you can guess all the data has been read. Just a idea, you can choose other solution. About creating a savepoint automatically on jo

Re: Bootstrapping the state

2018-07-20 Thread Henri Heiskanen
Hi, Thanks. Just to clarify, where would you then invoke the savepoint creation? I basically need to know when all data is read, create a savepoint and then exit. I think I could just as well use the PROCESS_CONTINUOSLY, monitor the stream outside and then use rest api to cancel with savepoint. A

Re: Bootstrapping the state

2018-07-19 Thread vino yang
Hi Henkka, The behavior of the text file source meets expectation. Flink will not keep your source task thread when it exit from it's invoke method. That means you should keep your source task alive. So to implement this, you should customize a text file source (implement SourceFunction interface)

Bootstrapping the state

2018-07-19 Thread Henri Heiskanen
Hi, I've been looking into how to initialise large state and especially checked this presentation by Lyft referenced in this group as well: https://www.youtube.com/watch?v=WdMcyN5QZZQ In our use case we would like to load roughly 4 billion entries into this state and I believe loading this data f

Re: Bootstrapping

2018-01-26 Thread Aljoscha Krettek
Hi, I see this coming up more and more often these days. For now, the solution of doing a savepoint and switching sources should work but I've had it in my head for a while now to add functionality for bootstrapping inputs in the API. An operator would read from the bootstrap stream (whi

Re: Bootstrapping

2018-01-25 Thread Chen Qin
Hi Gregory, I have similar issue when dealing with historical data. We choose Lambda and figure out use case specific hand off protocol. Unless storage side can support replay logs within a time range, Streaming application authors still needs to carry extra work to implement batching layer

Bootstrapping

2018-01-25 Thread Gregory Fee
Hi group, I want to bootstrap some aggregates based on historic data in S3 and then keep them updated based on a stream. To do this I was thinking of doing something like processing all of the historic data, doing a save point, then restoring my program from that save point but with a stream source