Hi Zhihao,
Thanks for driving the discussion.
First, I have the same question as Kurt.

Secondly,  seamless switching after SQL is upgraded is a complicated topic.
What problems does the system goal of the self-build management system
solve?
Upgrade SQL of a simple ETL job?  Upgrade SQL of a complicated job with
state,  for example add/delete an aggregate function, or add/delete a group
key, or update join condition, or update filter condition?
These questions above involve
1. If the topology changes, whether need to fulfill the data and how?
2. If The topology has not change and the operator has changed, whether
need to fulfill the data, whether need to recover state, and how?

Best,
JING ZHANG

Kurt Young <ykt...@gmail.com> 于2021年9月3日周五 上午9:15写道:

>  Could you explain why you need a backfill after you take v2 into
> production?
>
> Best,
> Kurt
>
>
> On Fri, Sep 3, 2021 at 2:02 AM zhihao wang <h...@chopin.fm> wrote:
>
> > Hi team
> >
> > Graceful Application Evolvement is a hard and open problem to the
> > community. We met this problem in our production, too. To address it, we
> > are planning to leverage a backfill-based approach with a self-build
> > management system. We'd like to learn the community feedback on this
> > approach.
> >
> > Our job structure is like this: Kafka Inputs -> Flink SQL v1 -> Kafka
> > Output. We need to keep the Kafka Output interface / address unchanged to
> > clients. We perform a code change in three steps:
> >
> > 1. *Develop Step:* The client launches a new Flink job to update the code
> > from Flink SQL v1 to Flink SQL v2 with structure:  Kafka Inputs -> Flink
> > SQL v2 -> TEMP Kafka. The new job will read the production inputs and
> write
> > into an auto-generated temporary Kafka topic so that there is no
> pollution
> > to the Flink SQL v1 job.
> >
> > 2. *Deploy Step*: When the client has tested thoroughly and thinks Flink
> > SQL v2 is ready to be promoted to production (the completeness is judged
> > manually by clients), the client can deploy Flink SQL v2 logic to
> > production in one click. Behind the scene, the system will automatically
> do
> > the following actions in sequence:
> >     2.1. The system will take a savepoint of Flink SQL v2 job which
> > contains all its internal states.
> >     2.2. The system will stop the Flink SQL v2 job and Flink SQL v1 job.
> >     2.3. The system will create a new production ready job with structure
> >  Kafka Inputs -> Flink SQL v2 -> Kafka output from 2.1's savepoint.
> >
> > 3. *Backfill Step*: After Deploy Step is done, the Flink SQL v2 is
> already
> > in production and serves the latest traffic. It’s at the client’s
> > discretion on when and how fast to perform a backfill to correct all the
> > records.
> >
> >     3.1. Here we need a special form of backfill: For the Flink job,
> given
> > one key in the schema of <Kafka Output>, the backfill will 1) send a
> > Refresh Record e.g. UPSERT <key, latest value> to clients if the key
> exists
> > in Flink states. 2) send a Delete Record e.g. DELETE<key, null> to
> clients
> > if the key doesn't exist in Flink states.
> >     3.2. The system will backfill all the records of two sinks in Deploy
> > Step <Kafka output> and <TEMP Kafka>. The backfill will either refresh
> > client records’ states or clean up clients’ stale records.
> >     3.3. After the backfill is completed, the <TEMP Kafka> will be
> > destroyed automatically by the system.
> >
> > Please let us know your opinions on this approach.
> >
> > Regards
> > Zhihao
> >
>

Reply via email to