Re: flink cluster startup time

Gyula Fóra Wed, 30 Mar 2022 22:09:38 -0700

Hi Frank!

Thank you for the interest.


As the others said, the flink-kubernetes-operator will give you quicker
job/cluster startup time together with full support for the application
mode.

Production readiness is always relative. If I had to build a new production
use-case I would not hesitate to use the flink-kubernetes-operator, simply
because the internal architecture is much simpler and it uses the Flink
Java Client APIs to interact with the cluster in a Flink-native manner. For
me as a java developer it is also much easier to debug/fix any issues that
might come up.

One important thing to note here is that the way the
flink-kubernetes-operator works is much less intrusive on the running
cluster compared to the SpotifyOperator. It internally relies on the Flink
Native Kubernetes integration (including HA) and once the job is submitted
it lets Flink take care of the rest.

I hope you will have some time to test our preview release and share some
feedback!

Cheers,
Gyula



On Thu, Mar 31, 2022 at 4:51 AM Yang Wang <danrtsey...@gmail.com> wrote:

> @Gyula Fóra <gyula.f...@gmail.com> is trying to prepare the preview
> release(0.1) for flink-kubernetes-operator. It now is fully functional for
> application mode.
> You could have a try and share more feedback with the community.
>
> The release-1.0 aims for production ready. And we still miss some
> important pieces(e.g. FlinkSessionJob, SQL job, observability improvements,
> etc.).
>
> Best,
> Yang
>
> Frank Dekervel <fr...@kapernikov.com> 于2022年3月30日周三 23:40写道：
>
>> Hello David,
>>
>> Thanks for the information! So the two main takeaways from your email are
>> to
>>
>>    - Move to something supporting application mode. Is
>>    https://github.com/apache/flink-kubernetes-operator already ready
>>    enough for production deployments ?
>>    - wait for flink 1.15
>>
>> thanks!
>> Frank
>>
>>
>> On Mon, Mar 28, 2022 at 9:16 AM David Morávek <d...@apache.org> wrote:
>>
>>> Hi Frank,
>>>
>>> I'm not really familiar with the internal workings of the Spotify's
>>> operator, but here are few general notes:
>>>
>>> - You only need the JM process for the REST API to become available (TMs
>>> can join in asynchronously). I'd personally aim for < 1m for this step, if
>>> it takes longer it could signal a problem with your infrastructure (eg.
>>> images taking long time to pull, incorrect setup of liveness / readiness
>>> probes, not enough resources).
>>>
>>> The job is packaged as a fat jar, but it is already baked in the docker
>>>> images we use (so technically there would be no need to "submit" it from a
>>>> separate pod).
>>>>
>>>
>>> That's where the application mode comes in. Please note that this might
>>> be also one of the reasons for previous steps taking too long (as all pods
>>> are pulling an image with your fat jar that might not be cached).
>>>
>>> Then the application needs to start up and load its state from the
>>>> latest savepoint, which again takes a couple of minutes
>>>>
>>>
>>> This really depends on the state size, state backend (eg. rocksdb
>>> restore might take longer), object store throughput / rate limit. The
>>> native-savepoint feature that will come out with 1.15 might help to shave
>>> off some time here as the there is no conversion into the state backend
>>> structures.
>>>
>>> Best,
>>> D.
>>>
>>>    -
>>>
>>>
>>> On Fri, Mar 25, 2022 at 9:46 AM Frank Dekervel <fr...@kapernikov.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We run flink using the spotify flink Kubernetes operator (job cluster
>>>> mode). Everything works fine, including upgrades and crash recovery. We do
>>>> not run the job manager in HA mode.
>>>>
>>>> One of the problems we have is that upon upgrades (or during testing),
>>>> the startup time of the flink cluster takes a very long time:
>>>>
>>>>    - First the operator needs to create the cluster (JM+TM), and wait
>>>>    for it to respond for api requests. This already takes a couple of 
>>>> minutes.
>>>>    - Then the operator creates a job-submitter pod that submits the
>>>>    job to the cluster. The job is packaged as a fat jar, but it is already
>>>>    baked in the docker images we use (so technically there would be no 
>>>> need to
>>>>    "submit" it from a separate pod). The submission goes rather fast tho 
>>>> (the
>>>>    time between the job submitter seeing the cluster is online and the 
>>>> "hello"
>>>>    log from the main program is <1min)
>>>>    - Then the application needs to start up and load its state from
>>>>    the latest savepoint, which again takes a couple of minutes
>>>>
>>>> All steps take quite some time, and we are looking to reduce the
>>>> startup time to allow for easier testing but also less downtime during
>>>> upgrades. So i have some questions:
>>>>
>>>>    - I wonder if the situation is the same for all kubernetes
>>>>    operators.  I really need some kind of operator because i otherwise i 
>>>> have
>>>>    to set which savepoint to load from myself every startup.
>>>>    - What cluster startup time is considered to be acceptable / best
>>>>    practise ?
>>>>    - If there are other tricks to reduce startup time, i would be very
>>>>    interested in knowing them :-)
>>>>
>>>> There is also a discussion ongoing on running flink on spot nodes. I
>>>> guess the startup time is relevant there too.
>>>>
>>>> Thanks already
>>>> Frank
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> [image: Kapernikov] <https://kapernikov.com/>
>> Frank Dekervel
>> +32 473 94 34 21 <+32473943421>
>> www.kapernikov.com <https://kapernikov.com/>
>> [image: Blog] <https://www.kapernikov.com/emailsigs/blog/redirector.php>
>>
>

Re: flink cluster startup time

Reply via email to