Re: Re: Add control mode for flink

2021-06-08 Thread Steven Wu
option 2 is probably not feasible, as checkpoint may take a long time or
may fail.

Option 1 might work, although it complicates the job recovery and
checkpoint. After checkpoint completion, we need to clean up those control
signals stored in HA service.

On Tue, Jun 8, 2021 at 1:14 AM 刘建刚  wrote:

> Thanks for the reply. It is a good question. There are multi choices as
> follows:
>
>1. We can persist control signals in HighAvailabilityServices and replay
>them after failover.
>2. Only tell the users that the control signals take effect after they
>are checkpointed.
>
>
> Steven Wu [via Apache Flink User Mailing List archive.] <
> ml+s2336050n44278...@n4.nabble.com> 于2021年6月8日周二 下午2:15写道:
>
> >
> > I can see the benefits of control flow. E.g., it might help the old (and
> > inactive) FLIP-17 side input. I would suggest that we add more details of
> > some of the potential use cases.
> >
> > Here is one mismatch with using control flow for dynamic config. Dynamic
> > config is typically targeted/loaded by one specific operator. Control
> flow
> > will propagate the dynamic config to all operators. not a problem per se
> >
> > Regarding using the REST api (to jobmanager) for accepting control
> > signals from external system, where are we going to persist/checkpoint
> the
> > signal? jobmanager can die before the control signal is propagated and
> > checkpointed. Did we lose the control signal in this case?
> >
> >
> > On Mon, Jun 7, 2021 at 11:05 PM Xintong Song <[hidden email]
> > > wrote:
> >
> >> +1 on separating the effort into two steps:
> >>
> >>1. Introduce a common control flow framework, with flexible
> >>interfaces for generating / reacting to control messages for various
> >>purposes.
> >>2. Features that leverating the control flow can be worked on
> >>concurrently
> >>
> >> Meantime, keeping collecting potential features that may leverage the
> >> control flow should be helpful. It provides good inputs for the control
> >> flow framework design, to make the framework common enough to cover the
> >> potential use cases.
> >>
> >> My suggestions on the next steps:
> >>
> >>1. Allow more time for opinions to be heard and potential use cases
> >>to be collected
> >>2. Draft a FLIP with the scope of common control flow framework
> >>3. We probably need a poc implementation to make sure the framework
> >>covers at least the following scenarios
> >>   1. Produce control events from arbitrary operators
> >>   2. Produce control events from JobMaster
> >>   3. Consume control events from arbitrary operators downstream
> >>   where the events are produced
> >>
> >>
> >> Thank you~
> >>
> >> Xintong Song
> >>
> >>
> >>
> >> On Tue, Jun 8, 2021 at 1:37 PM Yun Gao <[hidden email]
> >> > wrote:
> >>
> >>> Very thanks Jiangang for bringing this up and very thanks for the
> >>> discussion!
> >>>
> >>> I also agree with the summarization by Xintong and Jing that control
> >>> flow seems to be
> >>> a common buidling block for many functionalities and dynamic
> >>> configuration framework
> >>> is a representative application that frequently required by users.
> >>> Regarding the control flow,
> >>> currently we are also considering the design of iteration for the
> >>> flink-ml, and as Xintong has pointed
> >>> out, it also required the control flow in cases like detection global
> >>> termination inside the iteration
> >>>  (in this case we need to broadcast an event through the iteration body
> >>> to detect if there are still
> >>> records reside in the iteration body). And regarding  whether to
> >>> implement the dynamic configuration
> >>> framework, I also agree with Xintong that the consistency guarantee
> >>> would be a point to consider, we
> >>> might consider if we need to ensure every operator could receive the
> >>> dynamic configuration.
> >>>
> >>> Best,
> >>> Yun
> >>>
> >>>
> >>>
> >>> --
> >>> Sender:kai wang<[hidden email]
> >>> >
> >>> Date:2021/06/08 11:52:12
> >>> Recipient:JING ZHANG<[hidden email]
> >>> >
> >>> Cc:刘建刚<[hidden email]
> >>> >; Xintong Song
> >>> [via Apache Flink User Mailing List archive.]<[hidden email]
> >>> >; user<[hidden
> >>> email] >;
> dev<[hidden
> >>> email] >
> >>> Theme:Re: Add control mode for flink
> >>>
> >>>
> >>>
> >>> I'm big +1 for this feature.
> >>>
> >>>1. Limit the input qps.
> >>>2. Change log level for debug.
> >>>
> >>> in my team, the two examples above are needed
> >>>
> >>> JING ZHANG <[hidden email]
> >>> 

Re: Re: Add control mode for flink

2021-06-08 Thread 刘建刚
Thanks for the reply. It is a good question. There are multi choices as
follows:

   1. We can persist control signals in HighAvailabilityServices and replay
   them after failover.
   2. Only tell the users that the control signals take effect after they
   are checkpointed.


Steven Wu [via Apache Flink User Mailing List archive.] <
ml+s2336050n44278...@n4.nabble.com> 于2021年6月8日周二 下午2:15写道:

>
> I can see the benefits of control flow. E.g., it might help the old (and
> inactive) FLIP-17 side input. I would suggest that we add more details of
> some of the potential use cases.
>
> Here is one mismatch with using control flow for dynamic config. Dynamic
> config is typically targeted/loaded by one specific operator. Control flow
> will propagate the dynamic config to all operators. not a problem per se
>
> Regarding using the REST api (to jobmanager) for accepting control
> signals from external system, where are we going to persist/checkpoint the
> signal? jobmanager can die before the control signal is propagated and
> checkpointed. Did we lose the control signal in this case?
>
>
> On Mon, Jun 7, 2021 at 11:05 PM Xintong Song <[hidden email]
> > wrote:
>
>> +1 on separating the effort into two steps:
>>
>>1. Introduce a common control flow framework, with flexible
>>interfaces for generating / reacting to control messages for various
>>purposes.
>>2. Features that leverating the control flow can be worked on
>>concurrently
>>
>> Meantime, keeping collecting potential features that may leverage the
>> control flow should be helpful. It provides good inputs for the control
>> flow framework design, to make the framework common enough to cover the
>> potential use cases.
>>
>> My suggestions on the next steps:
>>
>>1. Allow more time for opinions to be heard and potential use cases
>>to be collected
>>2. Draft a FLIP with the scope of common control flow framework
>>3. We probably need a poc implementation to make sure the framework
>>covers at least the following scenarios
>>   1. Produce control events from arbitrary operators
>>   2. Produce control events from JobMaster
>>   3. Consume control events from arbitrary operators downstream
>>   where the events are produced
>>
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Tue, Jun 8, 2021 at 1:37 PM Yun Gao <[hidden email]
>> > wrote:
>>
>>> Very thanks Jiangang for bringing this up and very thanks for the
>>> discussion!
>>>
>>> I also agree with the summarization by Xintong and Jing that control
>>> flow seems to be
>>> a common buidling block for many functionalities and dynamic
>>> configuration framework
>>> is a representative application that frequently required by users.
>>> Regarding the control flow,
>>> currently we are also considering the design of iteration for the
>>> flink-ml, and as Xintong has pointed
>>> out, it also required the control flow in cases like detection global
>>> termination inside the iteration
>>>  (in this case we need to broadcast an event through the iteration body
>>> to detect if there are still
>>> records reside in the iteration body). And regarding  whether to
>>> implement the dynamic configuration
>>> framework, I also agree with Xintong that the consistency guarantee
>>> would be a point to consider, we
>>> might consider if we need to ensure every operator could receive the
>>> dynamic configuration.
>>>
>>> Best,
>>> Yun
>>>
>>>
>>>
>>> --
>>> Sender:kai wang<[hidden email]
>>> >
>>> Date:2021/06/08 11:52:12
>>> Recipient:JING ZHANG<[hidden email]
>>> >
>>> Cc:刘建刚<[hidden email]
>>> >; Xintong Song
>>> [via Apache Flink User Mailing List archive.]<[hidden email]
>>> >; user<[hidden
>>> email] >; dev<[hidden
>>> email] >
>>> Theme:Re: Add control mode for flink
>>>
>>>
>>>
>>> I'm big +1 for this feature.
>>>
>>>1. Limit the input qps.
>>>2. Change log level for debug.
>>>
>>> in my team, the two examples above are needed
>>>
>>> JING ZHANG <[hidden email]
>>> > 于2021年6月8日周二
>>> 上午11:18写道:
>>>
 Thanks Jiangang for bringing this up.
 As mentioned in Jiangang's email, `dynamic configuration framework`
 provides many useful functions in Kuaishou, because it could update job
 behavior without relaunching the job. The functions are very popular in
 Kuaishou, we also see similar demands in maillist [1].

 I'm big +1 for this feature.

 Thanks Xintong and Yun for deep thoughts about the issue. I like the
 idea about introducing control mode 

Re: Re: Add control mode for flink

2021-06-08 Thread Steven Wu
I can see the benefits of control flow. E.g., it might help the old (and
inactive) FLIP-17 side input. I would suggest that we add more details of
some of the potential use cases.

Here is one mismatch with using control flow for dynamic config. Dynamic
config is typically targeted/loaded by one specific operator. Control flow
will propagate the dynamic config to all operators. not a problem per se

Regarding using the REST api (to jobmanager) for accepting control
signals from external system, where are we going to persist/checkpoint the
signal? jobmanager can die before the control signal is propagated and
checkpointed. Did we lose the control signal in this case?


On Mon, Jun 7, 2021 at 11:05 PM Xintong Song  wrote:

> +1 on separating the effort into two steps:
>
>1. Introduce a common control flow framework, with flexible interfaces
>for generating / reacting to control messages for various purposes.
>2. Features that leverating the control flow can be worked on
>concurrently
>
> Meantime, keeping collecting potential features that may leverage the
> control flow should be helpful. It provides good inputs for the control
> flow framework design, to make the framework common enough to cover the
> potential use cases.
>
> My suggestions on the next steps:
>
>1. Allow more time for opinions to be heard and potential use cases to
>be collected
>2. Draft a FLIP with the scope of common control flow framework
>3. We probably need a poc implementation to make sure the framework
>covers at least the following scenarios
>   1. Produce control events from arbitrary operators
>   2. Produce control events from JobMaster
>   3. Consume control events from arbitrary operators downstream where
>   the events are produced
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Jun 8, 2021 at 1:37 PM Yun Gao  wrote:
>
>> Very thanks Jiangang for bringing this up and very thanks for the
>> discussion!
>>
>> I also agree with the summarization by Xintong and Jing that control flow
>> seems to be
>> a common buidling block for many functionalities and dynamic
>> configuration framework
>> is a representative application that frequently required by users.
>> Regarding the control flow,
>> currently we are also considering the design of iteration for the
>> flink-ml, and as Xintong has pointed
>> out, it also required the control flow in cases like detection global
>> termination inside the iteration
>>  (in this case we need to broadcast an event through the iteration body
>> to detect if there are still
>> records reside in the iteration body). And regarding  whether to
>> implement the dynamic configuration
>> framework, I also agree with Xintong that the consistency guarantee would
>> be a point to consider, we
>> might consider if we need to ensure every operator could receive the
>> dynamic configuration.
>>
>> Best,
>> Yun
>>
>>
>>
>> --
>> Sender:kai wang
>> Date:2021/06/08 11:52:12
>> Recipient:JING ZHANG
>> Cc:刘建刚; Xintong Song [via Apache Flink User
>> Mailing List archive.]; user<
>> u...@flink.apache.org>; dev
>> Theme:Re: Add control mode for flink
>>
>>
>>
>> I'm big +1 for this feature.
>>
>>1. Limit the input qps.
>>2. Change log level for debug.
>>
>> in my team, the two examples above are needed
>>
>> JING ZHANG  于2021年6月8日周二 上午11:18写道:
>>
>>> Thanks Jiangang for bringing this up.
>>> As mentioned in Jiangang's email, `dynamic configuration framework`
>>> provides many useful functions in Kuaishou, because it could update job
>>> behavior without relaunching the job. The functions are very popular in
>>> Kuaishou, we also see similar demands in maillist [1].
>>>
>>> I'm big +1 for this feature.
>>>
>>> Thanks Xintong and Yun for deep thoughts about the issue. I like the
>>> idea about introducing control mode in Flink.
>>> It takes the original issue a big step closer to essence which also
>>> provides the possibility for more fantastic features as mentioned in
>>> Xintong and Jark's response.
>>> Based on the idea, there are at least two milestones to achieve the
>>> goals which were proposed by Jiangang:
>>> (1) Build a common control flow framework in Flink.
>>>  It focuses on control flow propagation. And, how to integrate the
>>> common control flow framework with existing mechanisms.
>>> (2) Builds a dynamic configuration framework which is exposed to users
>>> directly.
>>>  We could see dynamic configuration framework is a top application
>>> on the underlying control flow framework.
>>>  It focuses on the Public API which receives configuration updating
>>> requests from users. Besides, it is necessary to introduce an API
>>> protection mechanism to avoid job performance degradation caused by too
>>> many control events.
>>>
>>> I suggest splitting the whole design into two after we reach a consensus
>>> on whether to introduce this feature because these two 

Re: Re: Add control mode for flink

2021-06-08 Thread Xintong Song
+1 on separating the effort into two steps:

   1. Introduce a common control flow framework, with flexible interfaces
   for generating / reacting to control messages for various purposes.
   2. Features that leverating the control flow can be worked on
   concurrently

Meantime, keeping collecting potential features that may leverage the
control flow should be helpful. It provides good inputs for the control
flow framework design, to make the framework common enough to cover the
potential use cases.

My suggestions on the next steps:

   1. Allow more time for opinions to be heard and potential use cases to
   be collected
   2. Draft a FLIP with the scope of common control flow framework
   3. We probably need a poc implementation to make sure the framework
   covers at least the following scenarios
  1. Produce control events from arbitrary operators
  2. Produce control events from JobMaster
  3. Consume control events from arbitrary operators downstream where
  the events are produced


Thank you~

Xintong Song



On Tue, Jun 8, 2021 at 1:37 PM Yun Gao  wrote:

> Very thanks Jiangang for bringing this up and very thanks for the
> discussion!
>
> I also agree with the summarization by Xintong and Jing that control flow
> seems to be
> a common buidling block for many functionalities and dynamic configuration
> framework
> is a representative application that frequently required by users.
> Regarding the control flow,
> currently we are also considering the design of iteration for the
> flink-ml, and as Xintong has pointed
> out, it also required the control flow in cases like detection global
> termination inside the iteration
>  (in this case we need to broadcast an event through the iteration body
> to detect if there are still
> records reside in the iteration body). And regarding  whether to implement
> the dynamic configuration
> framework, I also agree with Xintong that the consistency guarantee would
> be a point to consider, we
> might consider if we need to ensure every operator could receive the
> dynamic configuration.
>
> Best,
> Yun
>
>
>
> --
> Sender:kai wang
> Date:2021/06/08 11:52:12
> Recipient:JING ZHANG
> Cc:刘建刚; Xintong Song [via Apache Flink User
> Mailing List archive.]; user<
> u...@flink.apache.org>; dev
> Theme:Re: Add control mode for flink
>
>
>
> I'm big +1 for this feature.
>
>1. Limit the input qps.
>2. Change log level for debug.
>
> in my team, the two examples above are needed
>
> JING ZHANG  于2021年6月8日周二 上午11:18写道:
>
>> Thanks Jiangang for bringing this up.
>> As mentioned in Jiangang's email, `dynamic configuration framework`
>> provides many useful functions in Kuaishou, because it could update job
>> behavior without relaunching the job. The functions are very popular in
>> Kuaishou, we also see similar demands in maillist [1].
>>
>> I'm big +1 for this feature.
>>
>> Thanks Xintong and Yun for deep thoughts about the issue. I like the idea
>> about introducing control mode in Flink.
>> It takes the original issue a big step closer to essence which also
>> provides the possibility for more fantastic features as mentioned in
>> Xintong and Jark's response.
>> Based on the idea, there are at least two milestones to achieve the goals
>> which were proposed by Jiangang:
>> (1) Build a common control flow framework in Flink.
>>  It focuses on control flow propagation. And, how to integrate the
>> common control flow framework with existing mechanisms.
>> (2) Builds a dynamic configuration framework which is exposed to users
>> directly.
>>  We could see dynamic configuration framework is a top application on
>> the underlying control flow framework.
>>  It focuses on the Public API which receives configuration updating
>> requests from users. Besides, it is necessary to introduce an API
>> protection mechanism to avoid job performance degradation caused by too
>> many control events.
>>
>> I suggest splitting the whole design into two after we reach a consensus
>> on whether to introduce this feature because these two sub-topic all need
>> careful design.
>>
>>
>> [
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Dynamic-configuration-of-Flink-checkpoint-interval-td44059.html
>> ]
>>
>> Best regards,
>> JING ZHANG
>>
>> 刘建刚  于2021年6月8日周二 上午10:01写道:
>>
>>> Thanks Xintong Song for the detailed supplement. Since flink is
>>> long-running, it is similar to many services. So interacting with it or
>>> controlling it is a common desire. This was our initial thought when
>>> implementing the feature. In our inner flink, many configs used in yaml can
>>> be adjusted by dynamic to avoid restarting the job, for examples as follow:
>>>
>>>1. Limit the input qps.
>>>2. Degrade the job by sampling and so on.
>>>3. Reset kafka offset in certain cases.
>>>4. Stop checkpoint in certain cases.
>>>5. Control the history consuming.
>>>6. 

Re: Re: Add control mode for flink

2021-06-07 Thread Yun Gao
Very thanks Jiangang for bringing this up and very thanks for the discussion! 

I also agree with the summarization by Xintong and Jing that control flow seems 
to be
a common buidling block for many functionalities and dynamic configuration 
framework
is a representative application that frequently required by users. Regarding 
the control flow, 
currently we are also considering the design of iteration for the flink-ml, and 
as Xintong has pointed
out, it also required the control flow in cases like detection global 
termination inside the iteration
 (in this case we need to broadcast an event through the iteration body to 
detect if there are still 
records reside in the iteration body). And regarding  whether to implement the 
dynamic configuration 
framework, I also agree with Xintong that the consistency guarantee would be a 
point to consider, we 
might consider if we need to ensure every operator could receive the dynamic 
configuration. 

Best,
Yun



--
Sender:kai wang
Date:2021/06/08 11:52:12
Recipient:JING ZHANG
Cc:刘建刚; Xintong Song [via Apache Flink User Mailing 
List archive.]; 
user; dev
Theme:Re: Add control mode for flink



I'm big +1 for this feature. 

Limit the input qps.
Change log level for debug.
in my team, the two examples above are needed
JING ZHANG  于2021年6月8日周二 上午11:18写道:

Thanks Jiangang for bringing this up. 
As mentioned in Jiangang's email, `dynamic configuration framework` provides 
many useful functions in Kuaishou, because it could update job behavior without 
relaunching the job. The functions are very popular in Kuaishou, we also see 
similar demands in maillist [1].

I'm big +1 for this feature.

Thanks Xintong and Yun for deep thoughts about the issue. I like the idea about 
introducing control mode in Flink. 
It takes the original issue a big step closer to essence which also provides 
the possibility for more fantastic features as mentioned in Xintong and Jark's 
response.
Based on the idea, there are at least two milestones to achieve the goals which 
were proposed by Jiangang:
(1) Build a common control flow framework in Flink. 
 It focuses on control flow propagation. And, how to integrate the common 
control flow framework with existing mechanisms.
(2) Builds a dynamic configuration framework which is exposed to users 
directly. 
 We could see dynamic configuration framework is a top application on the 
underlying control flow framework. 
 It focuses on the Public API which receives configuration updating 
requests from users. Besides, it is necessary to introduce an API protection 
mechanism to avoid job performance degradation caused by too many control 
events.

I suggest splitting the whole design into two after we reach a consensus on 
whether to introduce this feature because these two sub-topic all need careful 
design.


[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Dynamic-configuration-of-Flink-checkpoint-interval-td44059.html]

Best regards,
JING ZHANG
刘建刚  于2021年6月8日周二 上午10:01写道:

Thanks Xintong Song for the detailed supplement. Since flink is long-running, 
it is similar to many services. So interacting with it or controlling it is a 
common desire. This was our initial thought when implementing the feature. In 
our inner flink, many configs used in yaml can be adjusted by dynamic to avoid 
restarting the job, for examples as follow:

Limit the input qps.
Degrade the job by sampling and so on.
Reset kafka offset in certain cases.
Stop checkpoint in certain cases.
Control the history consuming.
Change log level for debug.

After deep discussion, we realize that a common control flow will benefit both 
users and developers. Dynamic config is just one of the use cases. For the 
concrete design and implementation, it relates with many components, like 
jobmaster, network channel, operators and so on, which needs deeper 
consideration and design. 
Xintong Song [via Apache Flink User Mailing List archive.] 
 于2021年6月7日周一 下午2:52写道:

Thanks Jiangang for bringing this up, and Steven & Peter for the feedback.

I was part of the preliminary offline discussions before this proposal went 
public. So maybe I can help clarify things a bit.

In short, despite the phrase "control mode" might be a bit misleading, what we 
truly want to do from my side is to make the concept of "control flow" explicit 
and expose it to users.

## Background
Jiangang & his colleagues at Kuaishou maintain an internal version of Flink. 
One of their custom features is allowing dynamically changing operator 
behaviors via the REST APIs. He's willing to contribute this feature to the 
community, and came to Yun Gao and me for suggestions. After discussion, we 
feel that the underlying question to be answered is how do we model the control 
flow in Flink. Dynamically controlling jobs via REST API can be one of the 
features built on top of the control flow, and there could be others.

##