Re: long lived standalone job session cluster in kubernetes

Heath Albritton Thu, 14 Feb 2019 08:45:21 -0800

My team and I are keen to help out with testing and review as soon as there is 
a pill request.


-H

> On Feb 11, 2019, at 00:26, Till Rohrmann <trohrm...@apache.org> wrote:
> 
> Hi Heath,
> 
> I just learned that people from Alibaba already made some good progress with 
> FLINK-9953. I'm currently talking to them in order to see how we can merge 
> this contribution into Flink as fast as possible. Since I'm quite busy due to 
> the upcoming release I hope that other community members will help out with 
> the reviewing once the PRs are opened.
> 
> Cheers,
> Till
> 
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <halbr...@harm.org> wrote:
>> Has any progress been made on this?  There are a number of folks in
>> the community looking to help out.
>> 
>> 
>> -H
>> 
>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <trohrm...@apache.org> wrote:
>> >
>> > Hi Derek,
>> >
>> > there is this issue [1] which tracks the active Kubernetes integration. 
>> > Jin Sun already started implementing some parts of it. There should also 
>> > be some PRs open for it. Please check them out.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <derekver...@gmail.com> wrote:
>> >>
>> >> Sounds good.
>> >>
>> >> Is someone working on this automation today?
>> >>
>> >> If not, although my time is tight, I may be able to work on a PR for 
>> >> getting us started down the path Kubernetes native cluster mode.
>> >>
>> >>
>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>> >>
>> >> Hi Derek,
>> >>
>> >> what I would recommend to use is to trigger the cancel with savepoint 
>> >> command [1]. This will create a savepoint and terminate the job 
>> >> execution. Next you simply need to respawn the job cluster which you 
>> >> provide with the savepoint to resume from.
>> >>
>> >> [1] 
>> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
>> >> <and...@data-artisans.com> wrote:
>> >>>
>> >>> Hi Derek,
>> >>>
>> >>> I think your automation steps look good.
>> >>> Recreating deployments should not take long
>> >>> and as you mention, this way you can avoid unpredictable old/new version 
>> >>> collisions.
>> >>>
>> >>> Best,
>> >>> Andrey
>> >>>
>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org> 
>> >>> > wrote:
>> >>> >
>> >>> > Hi Derek,
>> >>> >
>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>> >>> > to help you more.
>> >>> >
>> >>> > As for the automation for similar process I would recommend having a
>> >>> > look at dA platform[1] which is built on top of kubernetes.
>> >>> >
>> >>> > Best,
>> >>> >
>> >>> > Dawid
>> >>> >
>> >>> > [1] https://data-artisans.com/platform-overview
>> >>> >
>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>> >>
>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>> >>> >> into Kubernetes.
>> >>> >>
>> >>> >> I do need to ask some questions because I haven't found a lot of
>> >>> >> details in the documentation about how it works yet, and I gave up
>> >>> >> following the the DI around in the code after a while.
>> >>> >>
>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> >>> >> another deployment for the taskmanagers.
>> >>> >>
>> >>> >> I want to upgrade the code or configuration and start from a
>> >>> >> savepoint, in an automated way.
>> >>> >>
>> >>> >> Best I can figure, I can not just update the deployment resources in
>> >>> >> kubernetes and allow the containers to restart in an arbitrary order.
>> >>> >>
>> >>> >> Instead, I expect sequencing is important, something along the lines
>> >>> >> of this:
>> >>> >>
>> >>> >> 1. issue savepoint command on leader
>> >>> >> 2. wait for savepoint
>> >>> >> 3. destroy all leader and taskmanager containers
>> >>> >> 4. deploy new leader, with savepoint url
>> >>> >> 5. deploy new taskmanagers
>> >>> >>
>> >>> >>
>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>> >>> >> job) attaching to the new leader and causing a problem.
>> >>> >>
>> >>> >> Does that sound right, or am I overthinking it?
>> >>> >>
>> >>> >> If not, has anyone tried implementing any automation for this yet?
>> >>> >>
>> >>> >
>> >>>

Re: long lived standalone job session cluster in kubernetes

Reply via email to