Hi Fuyao,

About your second question: You are right that taking and restoring from
savepoints will incur a performance loss. They cannot be incremental, and
cannot use native (low-level) data formats - for now. These issues are on
the list of things to improve for Flink 1.15, so if the changes make it
into the release, it may improve a lot.
You can restore a job from a retained checkpoint (provided you configured
retained checkpoints, else they are deleted on job cancellation), see [1]
(right below the part you linked). It should be possible to rescale using a
retained checkpoint, despite the docs suggesting otherwise (it was
uncertain whether this guarantee should/can be given, so it was not stated
in the docs. This is also expected to change in the future as it is a
necessity for further reactive mode development).

Best,
Nico

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#resuming-from-a-retained-checkpoint

On Wed, Oct 27, 2021 at 11:56 PM Fuyao Li <fuyao...@oracle.com> wrote:

> Hello Community,
>
>
>
> I am checking the reactive mode for Flink deployment. I noticed that this
> is supported in Kubernetes environment, but only for standalone Kubernetes
> as of now. I have read some previous discussion threads regarding this
> issue. See [1][2][3][4][5][6].
>
>
>
> Question 1:
>
> It seems that due to some interface and design considerations [4]
> mentioned by Robert and Xintong and official doc[5], this feature is only
> for standalone k8s and it is not available for native Kubernetes now.
> However, I believe in theory, it is possible to be added to native
> Kubernetes, right? Will this be part of the future plan? If not, what is
> the restriction and is it a hard restriction?
>
>
>
> Question 2:
>
> I have built an native Kubernetes operator on top of Yang’s work [7]
> supporting various state transfers in native k8s application mode and
> session mode. Right now, I am seeking for adding some similar features like
> reactive scaling for native k8s. From my perspective, what I can do is to
> enable periodic savepoints and scale up/down based certain metrics we
> collect inside the Flink application. Some additional resource
> considerations need to be added to implement such feature, similar to the
> adaptive scheduler concept in [9][10] (I didn’t dive deep into that, I
> guess I just need to calculated the new TMs will be offered with sufficient
> k8s resources if the rescale happens?)
>
> I think as a user/operator, I am not supposed by to be able to
> recover/restarted a job from checkpoint [8].
>
> I guess this might cause some performance loss since savepoints are more
> expensive and the Flink application must do both savepoint and checkpoint
> periodically… Is there any possible ways that user can also use checkpoints
> to restart and recover as a user? If Question 1 will be part of the future
> plan, I guess I won’t need much work here.
>
>
>
> Reference:
>
> [1] Reactive mode blog:
> https://flink.apache.org/2021/05/06/reactive-mode.html
>
> [2] example usage of reactive scaling:
> https://github.com/rmetzger/flink-reactive-mode-k8s-demo
>
> [3] FILP:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode
>
> [4] Discussion thread:
> https://lists.apache.org/thread.html/ra688faf9dca036500f0445c55671e70ba96c70f942afe650e9db8374%40%3Cdev.flink.apache.org%3E
>
> [5] Flink doc:
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/
>
> [6] Flink Jira: https://issues.apache.org/jira/browse/FLINK-10407\
> <https://issues.apache.org/jira/browse/FLINK-10407/>
>
> [7] https://github.com/wangyang0918/flink-native-k8s-operator
>
> [8]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#difference-to-savepoints
>
> [9]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management
>
> [10]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
>
>
>
> Thanks,
>
> Fuyao
>

Reply via email to