@Gyula Fóra <gyula.f...@gmail.com> is trying to prepare the preview release(0.1) for flink-kubernetes-operator. It now is fully functional for application mode. You could have a try and share more feedback with the community.
The release-1.0 aims for production ready. And we still miss some important pieces(e.g. FlinkSessionJob, SQL job, observability improvements, etc.). Best, Yang Frank Dekervel <fr...@kapernikov.com> 于2022年3月30日周三 23:40写道: > Hello David, > > Thanks for the information! So the two main takeaways from your email are > to > > - Move to something supporting application mode. Is > https://github.com/apache/flink-kubernetes-operator already ready > enough for production deployments ? > - wait for flink 1.15 > > thanks! > Frank > > > On Mon, Mar 28, 2022 at 9:16 AM David Morávek <d...@apache.org> wrote: > >> Hi Frank, >> >> I'm not really familiar with the internal workings of the Spotify's >> operator, but here are few general notes: >> >> - You only need the JM process for the REST API to become available (TMs >> can join in asynchronously). I'd personally aim for < 1m for this step, if >> it takes longer it could signal a problem with your infrastructure (eg. >> images taking long time to pull, incorrect setup of liveness / readiness >> probes, not enough resources). >> >> The job is packaged as a fat jar, but it is already baked in the docker >>> images we use (so technically there would be no need to "submit" it from a >>> separate pod). >>> >> >> That's where the application mode comes in. Please note that this might >> be also one of the reasons for previous steps taking too long (as all pods >> are pulling an image with your fat jar that might not be cached). >> >> Then the application needs to start up and load its state from the latest >>> savepoint, which again takes a couple of minutes >>> >> >> This really depends on the state size, state backend (eg. rocksdb restore >> might take longer), object store throughput / rate limit. The >> native-savepoint feature that will come out with 1.15 might help to shave >> off some time here as the there is no conversion into the state backend >> structures. >> >> Best, >> D. >> >> - >> >> >> On Fri, Mar 25, 2022 at 9:46 AM Frank Dekervel <fr...@kapernikov.com> >> wrote: >> >>> Hello, >>> >>> We run flink using the spotify flink Kubernetes operator (job cluster >>> mode). Everything works fine, including upgrades and crash recovery. We do >>> not run the job manager in HA mode. >>> >>> One of the problems we have is that upon upgrades (or during testing), >>> the startup time of the flink cluster takes a very long time: >>> >>> - First the operator needs to create the cluster (JM+TM), and wait >>> for it to respond for api requests. This already takes a couple of >>> minutes. >>> - Then the operator creates a job-submitter pod that submits the job >>> to the cluster. The job is packaged as a fat jar, but it is already baked >>> in the docker images we use (so technically there would be no need to >>> "submit" it from a separate pod). The submission goes rather fast tho >>> (the >>> time between the job submitter seeing the cluster is online and the >>> "hello" >>> log from the main program is <1min) >>> - Then the application needs to start up and load its state from the >>> latest savepoint, which again takes a couple of minutes >>> >>> All steps take quite some time, and we are looking to reduce the startup >>> time to allow for easier testing but also less downtime during upgrades. So >>> i have some questions: >>> >>> - I wonder if the situation is the same for all kubernetes >>> operators. I really need some kind of operator because i otherwise i >>> have >>> to set which savepoint to load from myself every startup. >>> - What cluster startup time is considered to be acceptable / best >>> practise ? >>> - If there are other tricks to reduce startup time, i would be very >>> interested in knowing them :-) >>> >>> There is also a discussion ongoing on running flink on spot nodes. I >>> guess the startup time is relevant there too. >>> >>> Thanks already >>> Frank >>> >>> >>> >>> >>> >>> > > -- > [image: Kapernikov] <https://kapernikov.com/> > Frank Dekervel > +32 473 94 34 21 <+32473943421> > www.kapernikov.com <https://kapernikov.com/> > [image: Blog] <https://www.kapernikov.com/emailsigs/blog/redirector.php> >