Sounds good.

Is someone working on this automation today?

If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.


On 12/4/18 5:35 AM, Till Rohrmann wrote:
Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.


Cheers,
Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <and...@data-artisans.com> wrote:
Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Reply via email to