Re: long lived standalone job session cluster in kubernetes

Derek VerLee Wed, 05 Dec 2018 09:49:11 -0800

Sounds good.

Is someone working on this automation today?

If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.

On 12/4/18 5:35 AM, Till Rohrmann wrote:

Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint

Cheers,

Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <and...@data-artisans.com> wrote:

Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Re: long lived standalone job session cluster in kubernetes

Reply via email to