Re: [DISCUSS] Is it a bug that the AdaptiveScheduler does not prioritize releasing TaskManagers during downscaling in Application mode?

2024-11-05 Thread Gyula Fóra
Hey All! The main purpose of the adaptive scheduler is to be able to adapt to changing resource availability and requirements. Originally it was designed to work based on resource availability (with reactive scaling) so when we have more resources we scale up, if we have less scale down, at that p

Re: [Flink Kubernetes Operator] How to enable Flink history server for the Flink jobs managed by Flink Kubernetes operator

2024-08-21 Thread Gyula Fóra
ast-state" vs "savepoint" in production? > > >I am actually working on adding a new way to perform the last-state > upgrade via simple cancellation but that's a slightly orthogonal question. > Will this new way help generate a new job.id during last-state upgrad

Re: [Flink Kubernetes Operator] How to enable Flink history server for the Flink jobs managed by Flink Kubernetes operator

2024-08-20 Thread Gyula Fóra
Hi Alan! The job.id remains the same as the last-state mode uses flinks internal failover mechanism to access the state. We cannot change the job.id while doing this unfortunately. Savepoint upgrades on the other hand would generate a new job id (at least after a recent fix on operator main). I a

Re: Flink reactive deployment on with kubernetes operator

2024-07-11 Thread Gyula Fóra
Hi Eric! The community cannot support old versions of the Flink operator, please upgrade to the latest version (1.9.0) Also, we do not recommend using the Reactive mode (with standalone). You should instead try Native Mode + Autoscaler which works much better in most cases. Cheers, Gyula

[ANNOUNCE] Apache Flink Kubernetes Operator 1.9.0 released

2024-07-03 Thread Gyula Fóra
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.9.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. Release blogpost: https://flink.apache.org/

Re: Understanding flink-autoscaler behavior

2024-06-07 Thread Gyula Fóra
Hi! To simplify things you can generally look at TRUE_PROCESSING_RATE, SCALUE_UP_RATE_THRESHOLD and SCALE_DOWN_RATE_THRESHOLD. If TPR is below the scale up threshold then we should scale up and if its above the scale down threshold then we scale down. In your case what we see for your source (cbc

Re: Flink Kubernetes Operator Pod Disruption Budget

2024-06-03 Thread Gyula Fóra
Hey Jeremy! This sounds like a good / fairly simple extension to add. Since this would result in a larger extension of the current FlinkDeployment CRD, it would be good to cover it in a small FLIP. Cheers, Gyula On Wed, May 22, 2024 at 10:20 PM Jeremy Alvis via user < user@flink.apache.org> wrot

Re: Flink Kubernetes Operator 1.8.0 CRDs

2024-05-09 Thread Gyula Fóra
Hey! We have not observed any issue so far, can you please share some error information / log ? Opening a jira ticket would be best Thanks Gyula On Thu, 9 May 2024 at 21:18, Prasad, Neil wrote: > I am writing to report an issue with the Flink Kubernetes Operator version > 1.8.0. The CRD is una

Re: Flink scheduler keeps trying to schedule the pods indefinitely

2024-05-05 Thread Gyula Fóra
Hey! Let me first answer your questions then provide some actual solution hopefully :) 1. The adaptive scheduler would not reduce the vertex desired parallelism in this case but it should allow the job to start depending on the lower/upper bound resource config. There have been some changes in ho

Re: Autoscaling with flink-k8s-operator 1.8.0

2024-05-01 Thread Gyula Fóra
Hi Chetas, The operator logic itself would normally call the rescale api during the upgrade process, not the autoscaler module. The autoscaler module sets the correct config with the parallelism overrides, and then the operator performs the regular upgrade cycle (as when you yourself change someth

Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-04-30 Thread Gyula Fóra
easier to implement > "last-state" upgrade mode. When you were saying "robust way", does it mean > "sticky job id" in application mode? > > > On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra wrote: > >> Hi Alan! >> >> I think it should be poss

Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

2024-04-29 Thread Gyula Fóra
Hi Alan! I think it should be possible to address this gap for most cases. We don't have the same robust way of getting the last-state information for session jobs as we do for applications, so it will be slightly less reliable overall. For session jobs the last checkpoint info has to be queried f

Re: [External] Exception during autoscaling operation - Flink 1.18/Operator 1.8.0

2024-04-26 Thread Gyula Fóra
ing job (autoscaler scales the job to > zero for “gracefully” stopping it and then never starts it) or > b) some jobs that keep restarting can be fixed by disabling HA for that job > > And ` *Cannot rescale the given pointwise partitioner.` *is also still a > mystery. > > *Thank

Re: [External] Exception during autoscaling operation - Flink 1.18/Operator 1.8.0

2024-04-26 Thread Gyula Fóra
Hi Maxim! Regarding the status update error, it could be related to a problem that we have discovered recently with the Flink Operator HA. Where during a namespace change both leader and follower instances would start processing. It has been fixed in the current master by updating the JOSDK versio

Re: Unaligned checkpoint blocked by long Async operation

2024-03-15 Thread Gyula Fóra
ch also involves some code modifications on > the mailbox executor. > > > Best, > Zakelly > > On Thu, Mar 14, 2024 at 9:15 PM Gyula Fóra wrote: > >> Thank you for the detailed analysis Zakelly. >> >> I think we should consider whether yield should process

Re: Unaligned checkpoint blocked by long Async operation

2024-03-14 Thread Gyula Fóra
24, > TimeUnit.HOURS, > 1) > .print(); > ``` > The checkpoint 1 can be normally finished after the "Complete one" log > print. > > I guess the users have no means to solve this problem, we might optimize > this later. >

Unaligned checkpoint blocked by long Async operation

2024-03-14 Thread Gyula Fóra
Hey all! I encountered a strange and unexpected behaviour when trying to use unaligned checkpoints with AsyncIO. If the async operation queue is full and backpressures the pipeline completely, then unaligned checkpoints cannot be completed. To me this sounds counterintuitive because one of the be

Re: Temporal join on rolling aggregate

2024-03-05 Thread Gyula Fóra
Hi Everyone! I have discussed this with Sébastien Chevalley, he is going to prepare and drive the FLIP while I will assist him along the way. Thanks Gyula On Tue, Mar 5, 2024 at 9:57 AM wrote: > I do agree with Ron Liu. > This would definitely need a FLIP as it would impact SQL and extend it >

Re: flink-operator-1.5.0 supports which versions of Kubernetes

2024-03-04 Thread Gyula Fóra
It should be compatible. There is no compatibility matrix but it is compatible with most versions that are in use (at the different companies/users etc) Gyula On Thu, Feb 29, 2024 at 6:21 AM 吴圣运 wrote: > Hi, > > I'm using flink-operator-1.5.0 and I need to deploy it to Kubernetes 1.20. > I want

Re: Temporal join on rolling aggregate

2024-02-22 Thread Gyula Fóra
Posting this to dev as well as it potentially has some implications on development effort. What seems to be the problem here is that we cannot control/override Timestamps/Watermarks/Primary key on VIEWs. It's understandable that you cannot create a PRIMARY KEY on the view but I think the temporal

Re: Flink Kubernetes Operator - Deadlock when Cluster Cleanup Fails

2024-02-13 Thread Gyula Fóra
Hi Niklas! The best way to report the issue would be to open a JIRA ticket with the same detailed information. Otherwise I think your observations are correct and this is indeed a frequent problem that comes up, it would be good to improve on it. In addition to improving logging we could also inc

Re: Flink pending record metric weired after autoscaler rescaling

2024-01-12 Thread Gyula Fóra
Could this be related to the issue reported here? https://issues.apache.org/jira/browse/FLINK-34063 Gyula On Wed, Jan 10, 2024 at 4:04 PM Yang LI wrote: > Just to give more context, my setup uses Apache Flink 1.18 with the > adaptive scheduler enabled, issues happen randomly particularly > post

Re: Re: Optional fields during SQL insert

2024-01-11 Thread Gyula Fóra
gt; > > -- > Best! > Xuyang > > > 在 2024-01-11 16:10:47,"Giannis Polyzos" 写道: > > Hi Gyula, > to the best of my knowledge, this is not feasible and you will have to do > something like *CAST(NULL AS STRING)* to insert null values manually. > > Best, &g

Optional fields during SQL insert

2024-01-10 Thread Gyula Fóra
Hi All! Is it possible to insert into a table without specifying all columns of the target table? In other words can we use the default / NULL values of the table when not specified somehow? For example: Query schema: [a: STRING] Sink schema: [a: STRING, b: STRING] I would like to be able to si

Re: [Flink Kubernetes Operator] Restoring from an outdated savepoint.

2023-12-22 Thread Gyula Fóra
Please upgrade the operator to the latest release, and if the issue still exists please open a Jira ticket with the details. Gyula On Fri, 22 Dec 2023 at 21:17, Ruibin Xing wrote: > I wanted to talk about an issue we've hit recently with Flink Kubernetes > Operator 1.6.1 and Flink 1.17.1. > > A

Re: Flink operator autoscaler scaling down

2023-12-11 Thread Gyula Fóra
;>> > Thank you for the feedback! With your permission, I plan to integrate >>> the implementation into the flink-kubernetes-operator-autoscaler module to >>> test it on my env. Subsequently, maybe contribute these changes back to the >>> community by submitting a

Re: Production deployment of Flink

2023-12-07 Thread Gyula Fóra
Hi! We recommend using the community supported Flink Kubernetes Operator: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.7/docs/try-flink-kubernetes-operator/quick-start/ Cheers, Gyula On Thu, Dec 7, 2023 at 6:33 PM Tauseef Janvekar wrote: > Hi Al, > > I am using

Re: Flink Kubernetes Operator: Why one Helm repo for each version?

2023-12-01 Thread Gyula Fóra
Hi! I already answered your question on slack : “The main reason is that this allows us to completely separate release resources etc. much easier for the release process But this could be improved in the future if there is a good proposal for the process” Please do not cross post questions bet

[ANNOUNCE] Apache Flink Kubernetes Operator 1.7.0 released

2023-11-22 Thread Gyula Fóra
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.7.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. Release highlights: - Standalone autoscale

Re: Flink operator autoscaler scaling down

2023-11-07 Thread Gyula Fóra
s or by >applying a kubectl patch to the FlinkDeployment CRD. > > By doing this we could achieve something similar to what we can do with a > plugin system, Of course in this case I'll disable scaling of the flink > operator, Do you think it could work? > > Best,

Re: Flink operator autoscaler scaling down

2023-11-06 Thread Gyula Fóra
Hey! Bit of a tricky problem, as it's not really possible to know that the job will be able to start with lower parallelism in some cases. Custom plugins may work but that would be an extremely complex solution at this point. The Kubernetes operator has a built-in rollback mechanism that can help

Re: Issue with flink-kubernetes-operator not updating execution.savepoint.path after savepoint deletion

2023-10-21 Thread Gyula Fóra
/savepoint-ee4f7c-550378a4b4d1 (allowing non restored state) >> ... >> 2023-10-21 10:25:47,703 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job >> ee4f7c678794ee16506f9b41425c244e reached terminal state FAILED. >> org.apache.flink.runtime

Re: Flink kubernets operator delete HA metadata after resuming from suspend

2023-10-19 Thread Gyula Fóra
om/apache/flink-kubernetes-operator/pull/673 > This solved the problem > > > -- > *От:* Tony Chen > *Отправлено:* 19 октября 2023 г. 4:18:36 > *Кому:* Evgeniy Lyutikov > *Копия:* user@flink.apache.org; Gyula Fóra > *Тема:* Re: Flink kuberne

Re: Flink Operator 1.6 causes JobManagerDeploymentStatus: MISSING

2023-10-18 Thread Gyula Fóra
Hi! Not sure if it’s the same but could you try picking up the fix from the release branch and confirming that it solves the problem? If it does we may consider a quick bug fix release. Cheers Gyula On Wed, 18 Oct 2023 at 18:09, Tony Chen wrote: > Hi Flink Community, > > Most of the Flink appl

Re: [EXTERNAL]Re: Next Release of the Flink Kubernetes Operator

2023-10-17 Thread Gyula Fóra
very much for the update about the release schedule and for > pointing me to the snapshot images. This is indeed very helpful and we will > consider our options now. > > Regards, > Niklas > > On 16. Oct 2023, at 17:56, Gyula Fóra wrote: > > Hi Niklas! > > We weren'

Re: Next Release of the Flink Kubernetes Operator

2023-10-16 Thread Gyula Fóra
Hi Niklas! We weren't planning a 1.6.1 release and instead we were focusing on wrapping up changes for the 1.7.0 release coming in a month or so. However if there is enough interest and we have some committers/PMC willing to help with the release we can always do 1.6.1 but I personally don't have

Re: Using Flink k8s operator on OKD

2023-10-05 Thread Gyula Fóra
Hey, We don’t have minimal supported version in the docs as we haven’t experienced any issue specific to kubernetes versions so far. We don’t really rely on any newer features Cheers Gyula On Fri, 6 Oct 2023 at 06:02, Krzysztof Chmielewski < krzysiek.chmielew...@gmail.com> wrote: > It seems t

Re: Rolling back a bad deployment of FlinkDeployment on kubernetes

2023-10-05 Thread Gyula Fóra
Hi Tony! There are still a few corner cases when the operator cannot upgrade / rollback deployments due to the loss of HA metadata (and with that checkpoint information). Most of these issues are not related to the operator logic directly but to how Flink handles certain failures and are related

Re: Issue with flink-kubernetes-operator not updating execution.savepoint.path after savepoint deletion

2023-09-22 Thread Gyula Fóra
Hi Operator savepoint retention and savepoint upgrades have nothing to do with each other I think. Retention is only for periodic savepoints triggered by the operator itself. I would upgrade to the latest 1.6.0 operator version before investigating further. Cheers Gyula On Sat, 23 Sep 2023 at

Re: Zookeeper HA with Kubernetes: Possible to use the same Zookeeper cluster w/multiple Flink Operators?

2023-09-20 Thread Gyula Fóra
Hi! The cluster-id for each FlinkDeployment is simply the name of the deployment. So they are all different in a given namespace. (In other words they are not fixed as your question suggests but set automatically) So there should be no problem sharing the ZK cluster . Cheers Gyula On Thu, 21 Se

Re: Checkpoint jitter?

2023-09-13 Thread Gyula Fóra
No, I think what he means is to trigger the checkpoint at slightly different times at the different sources so the different parts of the pipeline would not checkpoint at the same time. Gyula On Wed, Sep 13, 2023 at 10:32 AM Hangxiang Yu wrote: > Hi, Matyas. > Do you mean something like adjusti

Re: Flink kubernets operator delete HA metadata after resuming from suspend

2023-09-12 Thread Gyula Fóra
n Mon, Sep 11, 2023 at 7:47 PM Gyula Fóra wrote: > You don’t need it but you can really mess up clusters by rolling back CRD > changes… > > On Mon, 11 Sep 2023 at 19:42, Evgeniy Lyutikov > wrote: > >> Why we need to use latest CRD versio

Re: Flink kubernets operator delete HA metadata after resuming from suspend

2023-09-11 Thread Gyula Fóra
You don’t need it but you can really mess up clusters by rolling back CRD changes… On Mon, 11 Sep 2023 at 19:42, Evgeniy Lyutikov wrote: > Why we need to use latest CRD version with older operator version? > -- > *От:* Gyula Fóra > *Отправлено:* 12 сен

Re: Flink kubernets operator delete HA metadata after resuming from suspend

2023-09-11 Thread Gyula Fóra
11 сентября 2023 г. 23:50:26 > *Кому:* Gyula Fóra > > *Копия:* user@flink.apache.org > *Тема:* Re: Flink kubernets operator delete HA metadata after resuming > from suspend > > > Hi! > No, no one could restart jobmanager, > I monitored the pods in real time, the

Re: Flink kubernets operator delete HA metadata after resuming from suspend

2023-09-11 Thread Gyula Fóra
Hi! I could not reproduce your issue, last-state suspend/restore seems to work as before. However these 2 logs seem very suspicious: 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO ][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING 2023-09-11 06:02:07,488

Re: observedGeneration field in FlinkDeployment

2023-09-08 Thread Gyula Fóra
ot; flink.apache.org/v1beta1 ","metadata":{"generation":2},"firstDeployment":true}}' It's a bit hidden but it should do the trick :) We could discuss moving this to a more standardized status field if you think that's worth the effort. Gyula On Sat, Sep

Re: observedGeneration field in FlinkDeployment

2023-09-08 Thread Gyula Fóra
Hi! The lastReconciledSpec field serves similar purpose . We also use the generation in parts of the logic but not generically as observed generation . Could you give an example where this would be useful in addition to what we already have? Thanks Gyula On Sat, 9 Sep 2023 at 02:17, Tony Chen w

Re: [Question] How to scale application based on 'reactive' mode

2023-09-07 Thread Gyula Fóra
;Also, the metrics is on a per-task granularity and allows us to identify >>bottleneck tasks. >>3. Autoscaler feature currently only works for K8s opeartor + native >>K8s mode. >> >> >> Best, >> Zhanghao Chen >> -

Re: [Question] How to scale application based on 'reactive' mode

2023-09-01 Thread Gyula Fóra
ill be restarted when scaling. But job > parallelism is the same after the number of TM has been changed. > > *Autoscaler + 'reactive' mode*: > It can control numbers of TM by metric, and increase/decrease job > parallelism by changing TM. > > Regards, > Jung >

Re: [Question] How to scale application based on 'reactive' mode

2023-09-01 Thread Gyula Fóra
are above or below a certain threshold, additional >> TaskManagers can be added or removed from the Flink cluster.* >> => Why is this only possible in 'reactive' mode? Seems this is more >> related to 'autoscaler'. Are there some specific features/API wh

Re: [Question] How to scale application based on 'reactive' mode

2023-08-31 Thread Gyula Fóra
> Regards, > Jung > > > > 2023년 8월 18일 (금) 오후 7:51, Gyula Fóra 님이 작성: > >> Hi! >> >> I think what you need is probably not the reactive mode but a proper >> autoscaler. The reactive mode as you say doesn't do anything in itself, you >> need to

Re: flink k8s operator - problem with patching seession cluster

2023-08-31 Thread Gyula Fóra
quot;Flink Deployments: " + item); > System.out.println("Number of TM replicas: " + > item.getSpec().getTaskManager().getReplicas()); > } > } > > > Thanks, > Krzysztof > > czw., 31 sie 2023 o 10:44 Gyula Fóra napisał(a): &

Re: flink k8s operator - problem with patching seession cluster

2023-08-31 Thread Gyula Fóra
I guess your question is in the context of the standalone integration because native session deployments automatically add TMs on the fly as more are necessary. For standalone mode you should be able to configure `spec.taskManager.replicas` and if I understand correctly that will not shut down the

Re: Blue green deployment with Flink Apache Operator

2023-08-31 Thread Gyula Fóra
ill have something to share with you. > > Nicolas > > On Wed, Aug 30, 2023 at 4:28 PM Gyula Fóra wrote: > >> Hey! >> >> I don't know if anyone has implemented this or not but one way to >> approach this problem (and this may not be the right way, just an id

Re: Enable RocksDB in FlinkDeployment with flink-kubernetes-operator

2023-08-30 Thread Gyula Fóra
t; >> I will also need to go through the documentation more on memory >> configuration: >> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/state_backends/ >> >> On Wed, Aug 30, 2023 at 2:17 PM Gyula Fóra wrote: >> >>> Hi! >>>

Re: Enable RocksDB in FlinkDeployment with flink-kubernetes-operator

2023-08-30 Thread Gyula Fóra
Hi! Rocksdb is supported and every other state backend as well. You can simply set this in you config like before :) Cheers Gyula On Wed, 30 Aug 2023 at 19:22, Tony Chen wrote: > Hi Flink Community, > > Does the flink-kubernetes-operator support RocksDB as the state backend > for FlinkDeploym

Re: Blue green deployment with Flink Apache Operator

2023-08-30 Thread Gyula Fóra
Hey! I don't know if anyone has implemented this or not but one way to approach this problem (and this may not be the right way, just an idea :) ) is to add a new Custom Resource type that sits on top of the FlinkDeployment / FlinkSessionJob resources and add a small controller for this. This new

Re: [Question] How to scale application based on 'reactive' mode

2023-08-18 Thread Gyula Fóra
Hi! I think what you need is probably not the reactive mode but a proper autoscaler. The reactive mode as you say doesn't do anything in itself, you need to build a lot of logic around it. Check this instead: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resou

[ANNOUNCE] Apache Flink Kubernetes Operator 1.6.0 released

2023-08-15 Thread Gyula Fóra
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.6.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. Release highlights: - Improved rollback me

Re: Flink Kubernetes Operator autoscaling GPU-based workload

2023-08-01 Thread Gyula Fóra
The autoscaler only works for FlinkDeployments in Native mode. You should turn off the reactive scheduler mode as well because that's something completely different. After that you can check the autoscaler logs for more info. Gyula On Tue, Aug 1, 2023 at 10:33 AM Raihan Sunny via user wrote: >

Re: [EXTERNAL] Re: Query on flink-operator autoscale support

2023-08-01 Thread Gyula Fóra
s cluster, if one pod is added to the session cluster, > the job running on will be rebalanced to the new one, is it correct? > > Thank you very much. > Xiao Ma > > On Wed, Feb 1, 2023 at 10:56 AM Gyula Fóra wrote: > >> As I mentioned in the previous email, standalone mod

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

2023-07-21 Thread Gyula Fóra
application? I was wondering if there's a field in > the kubernetes field where I can specify which checkpoint to start from. > For some of our applications, we complete checkpoints more often > than savepoints, and we would like these Flink applications to always start > from the

Re: How to define imagePullSecrets with k8s operator to fetch image using FlinkDeploymentSpec

2023-07-21 Thread Gyula Fóra
Hi! We don't have imagePullSecrets as part of the FlinkDeplyomentSpec at the moment, however you can simply use the following built in Flink configuration: https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/config/#kubernetes-container-image-pull-secrets kubernetes.containe

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

2023-07-19 Thread Gyula Fóra
ommunity be okay with us adding this feature to the GitHub > repo eventually? I was going through this guide > <https://flink.apache.org/how-to-contribute/contribute-code/>, and it > looks like I need to get consensus first. > > Thanks, > Tony > > On Wed, Jul 19, 202

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

2023-07-19 Thread Gyula Fóra
deployment gets deleted. > > Thanks, > Tony > > On Wed, Jul 19, 2023 at 3:46 PM Gyula Fóra wrote: > >> Hey Tony, >> >> Please see: >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful

Re: Questions on Restarting a Flink Application from a savepoint or checkpoint

2023-07-19 Thread Gyula Fóra
Hey Tony, Please see: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades The operator is made especially to handle stateful application upgrades robustly. In general any spec change that you make

Re: [Flink K8s Operator] Trigger nonce missing for manual savepoint info

2023-07-12 Thread Gyula Fóra
Maybe you have inconsistent operator / CRC versions? In any case I highly recommend upgrading to the lates operator version to get all the bug / security fixes and improvements. Gyula On Wed, 12 Jul 2023 at 10:58, Paul Lam wrote: > Hi, > > I’m using K8s operator 1.3.1 with Flink 1.15.2 on 2 K8s

Re: why are kubernetes.namespace and kubernetes.cluster-id config fields forbidden in the flink-kubernetes-operator validator?

2023-06-14 Thread Gyula Fóra
The namespace and cluster id are automatically set based on the namespace and name of the FlinkDeployment resource . This is an important design choice that allows efficient management of the applications. Gyula On Wed, 14 Jun 2023 at 19:31, Nathan Moderwell < nathan.moderw...@robinhood.com> wro

Re: Kubernetes operator: config for taskmanager.memory.process.size ignored

2023-06-14 Thread Gyula Fóra
the process > memory and the pod memory, which helped stability. It looks like it cannot > be done with the k8s operator though and I wonder why the choice of > removing this granularity in the settings > > Robin > > Le mer. 14 juin 2023 à 12:20, Gyula Fóra a écrit : > &

Re: Kubernetes operator: config for taskmanager.memory.process.size ignored

2023-06-14 Thread Gyula Fóra
Basically what happens is that whatever you set to the spec.taskManager.resource.memory will be set in the config as process memory. In Flink kubernetes the process is the pod so pod memory is always equal to process memory. So basically the spec is a config shorthand, there is no reason to overri

Re: Fail to run flink 1.17 job with flink-operator 1.5.0 version

2023-06-12 Thread Gyula Fóra
Hi! I think you forgot to upgrade the operator CRD (which contains the updates enum values). Please see: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/upgrade/#1-upgrading-the-crd Cheers Gyula On Mon, 12 Jun 2023 at 13:38, Liting Liu (litiliu) wrote: >

Re: RocksDB segfault on state restore

2023-06-01 Thread Gyula Fóra
also saw core dump while using list state after triggering state > migration and ttl compaction filter. Have you triggered the schema > evolution ? > It seems a bug of the rocksdb list state together with ttl compaction > filter. > > On Wed, May 17, 2023 at 7:05 PM Gyula Fóra wrote

Re: Flink Kubernetes Operator lifecycle state count metrics question

2023-05-23 Thread Gyula Fóra
Hi Andrew! I think you are completely right, this is a bug. The per namespace metrics do not seem to filter per namespace and show the aggregated global count for each namespace: I opened a ticket: https://issues.apache.org/jira/browse/FLINK-32164 Thanks for reporting this! Gyula On Mon, May 22

[ANNOUNCE] Apache Flink Kubernetes Operator 1.5.0 released

2023-05-17 Thread Gyula Fóra
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.5.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. Release highlights: - Autoscaler improveme

RocksDB segfault on state restore

2023-05-17 Thread Gyula Fóra
Hi All! We are encountering an error on a larger stateful job (around 1 TB + state) on restore from a rocksdb checkpoint. The taskmanagers keep crashing with a segfault coming from the rocksdb native logic and seem to be related to the FlinkCompactionFilter mechanism. The gist with the full error

Re: [Flink K8s Operator] Automatic cleanup of terminated deployments

2023-05-14 Thread Gyula Fóra
There is no such feature currently, Kubernetes resources usually do not delete themselves :) The problem I see here is by deleting the resource you lose all information about what happened, you won't know if it failed or completed etc. What is the use-case you are thinking about? If this is someth

Re: flink-kubernetes-operator HA k8s RoleBinding for Leases?

2023-05-08 Thread Gyula Fóra
Hey! Sounds like a bug :) Could you please open a jira / PR (in case you fixed this already)? Thanks Gyula On Mon, 8 May 2023 at 22:20, Andrew Otto wrote: > Hi, > > I'm trying to enable HA for flink-kubernetes-operator >

Re: Autoscaler

2023-05-01 Thread Gyula Fóra
There is only one kind of autoscaler in the Flink Kubernetes Operator. And the docs can be found here: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/ We usually refer to it as the Job Autoscaler (as it scales individual jobs) but the mechani

Re: Can I setup standby taskmanagers while using reactive mode?

2023-04-27 Thread Gyula Fóra
tion for this problem in the future. > > > On Wed, Apr 26, 2023 at 7:20 AM Gyula Fóra wrote: > >> I think the behaviour is going to get a little weird because this would >> actually defeat the purpose of the standby TM. >> MAX - some offset will decrease once you

Re: Flink Kubernetes Operator Scale Issue

2023-04-27 Thread Gyula Fóra
Hi! It’s currently not possible to run the operator in parallel by simply adding more replicas. However there are different things you can do to scale both vertically and horizontally. First of all you can run multiple operators each watching different set of namespaces to partition the load. Th

Re: Can I setup standby taskmanagers while using reactive mode?

2023-04-26 Thread Gyula Fóra
I think the behaviour is going to get a little weird because this would actually defeat the purpose of the standby TM. MAX - some offset will decrease once you lose a TM so in this case we would scale down to again have a spare (which we never actually use.) Gyula On Wed, Apr 26, 2023 at 4:02 PM

Re: [Flink operator] Flink Autoscale - Limit the max number of scale ups

2023-04-24 Thread Gyula Fóra
Hi! Please opena JIRA ticket with the details of your log, config and operator version and we will take a look! Thanks Gyula On Mon, Apr 24, 2023 at 2:04 PM Sriram Ganesh wrote: > Hi, > > I am trying the autoscale provided by the operator. I found that Autoscale > keeps happening even after re

Re: Kubernetes operator stops responding due to Connection reset by peer

2023-04-21 Thread Gyula Fóra
Hi Alexis, We have recently added support for canary deployments which allows the liveness probe to detect general operator problems. https://issues.apache.org/jira/browse/FLINK-31219 It's not completely automatic and you have to deploy the canaries yourself but I think it will be helpful :) Thi

Re: [Kubernetes Operator] NullPointerException from KubernetesApplicationClusterEntrypoint

2023-03-31 Thread Gyula Fóra
Never seen this before but also you should not set the cluster-id in your config as that should be controlled by the operator itself. Gyula On Fri, Mar 31, 2023 at 2:39 PM Pierre Bedoucha wrote: > Hi, > > > > We are trying to use Flink Kubernetes Operator 1.4.0 with Flink 1.16. > > > > However,

Re: Unable to Use spec.flinkVersion v1_17 with Flink Operator

2023-03-28 Thread Gyula Fóra
I think you forgot to upgrade the CRD during the upgrade process on your cluster. As you can see here: https://github.com/apache/flink-kubernetes-operator/blob/release-1.4/helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml#L38-L44 The newer version already contains suppor

Re: Flink K8s operator pod section of CRD

2023-02-23 Thread Gyula Fóra
Hey! You are right, these fields could have been of the PodTemplate / PodTemplateSpec type (probably PodTemplateSpec is actually better). I think the reason why we used it is two fold: - Simple oversight :) - Flink itself "expects" the podtemplate in this form for the native integration as you c

[ANNOUNCE] Apache Flink Kubernetes Operator 1.4.0 released

2023-02-23 Thread Gyula Fóra
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.4.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. Release highlights: - Flink Job Autosca

Re: Kubernetes operator's merging strategy for template arrays

2023-02-23 Thread Gyula Fóra
If you are interested in helping to review this, here is the relevant ticket and the PR I just opened: https://issues.apache.org/jira/browse/FLINK-30786 https://github.com/apache/flink-kubernetes-operator/pull/535 Cheers, Gyula On Thu, Feb 23, 2023 at 2:10 PM Gyula Fóra wrote: > Hi! >

Re: Kubernetes operator's merging strategy for template arrays

2023-02-23 Thread Gyula Fóra
Hi! The current array merging strategy in the operator is basically an overwrite by position yes. I actually have a pending improvement to make this configurable and allow merging arrays by "name" attribute. This is generally more practical for such cases. Cheers, Gyula On Thu, Feb 23, 2023 at 1

Re: [EXTERNAL] Re: Query on flink-operator autoscale support

2023-02-01 Thread Gyula Fóra
ula. > Is there a roadmap to support standalone session clusters to scale based > on the jobs added/deleted and change in parallelism ? > > Regards, > Swathi C > > ------ > *From:* Gyula Fóra > *Sent:* Wednesday, February 1, 2023 8:54 PM > *To:* S

Re: Query on flink-operator autoscale support

2023-02-01 Thread Gyula Fóra
The autoscaler currently only works with Native App clusters. Native session clusters may be supported in the future but standalone is not on our roadmap due to a different resource/scheduling model used. Gyula On Wed, Feb 1, 2023 at 4:22 PM Swathi Chandrashekar wrote: > Hi, > > I'm was testing

Re: "Error while retrieving the leader gateway" when using Kubernetes HA

2023-01-31 Thread Gyula Fóra
Anton Ippolitov < anton.ippoli...@datadoghq.com> wrote: > I am using the Standalone Mode indeed, should've mentioned it right away. > This fix looks exactly like what I need, thank you!! > > On Tue, Jan 31, 2023 at 9:16 AM Gyula Fóra wrote: > >> There is also a pending

Re: "Error while retrieving the leader gateway" when using Kubernetes HA

2023-01-31 Thread Gyula Fóra
ink/container_utils.go#L215>, >> I thought >> this would be a common issue but since you've never seen this error before, >> not sure what to do 🤔 >> >> On Fri, Jan 27, 2023 at 10:59 PM Gyula Fóra wrote: >> >>> We never encountered

Re: "Error while retrieving the leader gateway" when using Kubernetes HA

2023-01-27 Thread Gyula Fóra
We never encountered this problem before but also we don't configure those settings. Can you simply try: high-availability: kubernetes And remove the other configs? I think that can only cause problems and should not achieve anything :) Gyula On Fri, Jan 27, 2023 at 6:44 PM Anton Ippolitov via

Re: PyFlink job in kubernetes operator

2023-01-25 Thread Gyula Fóra
Did you check the Python example? https://github.com/apache/flink-kubernetes-operator/tree/main/examples/flink-python-example Gyula On Wed, Jan 25, 2023 at 2:54 PM Evgeniy Lyutikov wrote: > Hello > > Is there a way to run PyFlink jobs in k8s with flink kubernetes operator? > And if not, is it p

Job stuck in CREATED state with scheduling failures

2023-01-21 Thread Gyula Fóra
Hi Devs! We noticed a very strange failure scenario a few times recently with the Native Kubernetes integration. The issue is triggered by a heartbeat timeout (a temporary network problem). We observe the following behaviour: === 3 pods (1 JM, 2 TMs), Flink 1.15 (

Re: Kubernetes JobManager and TaskManager minimum/maximum resources

2023-01-21 Thread Gyula Fóra
But of course the actual memory requirement will largely depend on the type of job, statebackend , number of task slots etc Production TM/JMs usually have much more resources allocated than 2gb/1cpu as you never want to run out of it :) Gyula On Sat, 21 Jan 2023 at 11:17, Gyula Fóra wrote

Re: Kubernetes JobManager and TaskManager minimum/maximum resources

2023-01-21 Thread Gyula Fóra
Hi! I think the examples allocate too many resources by default and we should reduce it in the yamls. 1gb memory and 0.5 cpu should be more than enough , we could probably get away with even less for example purposes. Would you have time trying this out and maybe contributing this improvement? :

Re: DuplicateJobSubmissionException on restart after taskmanagers crash

2023-01-21 Thread Gyula Fóra
Hi Javier, I will try to look into this as I have not personally seen this problem while using the operator . It would be great if you could reach out to me on slack or email directly so we can discuss the issue and get to the bottom of it. Cheer Gyula On Fri, 20 Jan 2023 at 23:53, Javier Vegas

Re: Flink Kubernetes Operator podTemplate and 'app' pod label bug?

2023-01-19 Thread Gyula Fóra
.java#L43 > > On Thu, Jan 19, 2023 at 1:59 PM Őrhidi Mátyás > wrote: > >> On a side note, we should probably use a qualified label name instead of >> the pretty common app here. WDYT Gyula? >> >> On Thu, Jan 19, 2023 at 1:48 PM Gyula Fóra wrote: >> &g

  1   2   3   4   >