long lived standalone job session cluster in kubernetes

2018-11-29 Thread Derek VerLee

  
  
I'm looking at the job cluster mode, it looks great and I and
  considering migrating our jobs off our "legacy" session cluster
  and into Kubernetes.

I do need to ask some questions because I haven't found a lot of
  details in the documentation about how it works yet, and I gave up
  following the the DI around in the code after a while.

Let's say I have a deployment for the job "leader" in HA with ZK,
  and another deployment for the taskmanagers.
I want to upgrade the code or configuration and start from a
  savepoint, in an automated way.

Best I can figure, I can not just update the deployment resources
  in kubernetes and allow the containers to restart in an arbitrary
  order.
Instead, I expect sequencing is important, something along the
  lines of this:
1. issue savepoint command on leader
  2. wait for savepoint
  3. destroy all leader and taskmanager containers
  4. deploy new leader, with savepoint url
  5. deploy new taskmanagers



 For example, I imagine old taskmanagers (with an old version of
  my job) attaching to the new leader and causing a problem.
Does that sound right, or am I overthinking it? 

If not, has anyone tried implementing any automation for this
  yet?
  



Re: long lived standalone job session cluster in kubernetes

2018-12-04 Thread Dawid Wysakowicz
Hi Derek,

I am not an expert in kubernetes, so I will cc Till, who should be able
to help you more.

As for the automation for similar process I would recommend having a
look at dA platform[1] which is built on top of kubernetes.

Best,

Dawid

[1] https://data-artisans.com/platform-overview

On 30/11/2018 02:10, Derek VerLee wrote:
>
> I'm looking at the job cluster mode, it looks great and I and
> considering migrating our jobs off our "legacy" session cluster and
> into Kubernetes.
>
> I do need to ask some questions because I haven't found a lot of
> details in the documentation about how it works yet, and I gave up
> following the the DI around in the code after a while.
>
> Let's say I have a deployment for the job "leader" in HA with ZK, and
> another deployment for the taskmanagers.
>
> I want to upgrade the code or configuration and start from a
> savepoint, in an automated way.
>
> Best I can figure, I can not just update the deployment resources in
> kubernetes and allow the containers to restart in an arbitrary order.
>
> Instead, I expect sequencing is important, something along the lines
> of this:
>
> 1. issue savepoint command on leader
> 2. wait for savepoint
> 3. destroy all leader and taskmanager containers
> 4. deploy new leader, with savepoint url
> 5. deploy new taskmanagers
>
>
> For example, I imagine old taskmanagers (with an old version of my
> job) attaching to the new leader and causing a problem.
>
> Does that sound right, or am I overthinking it?
>
> If not, has anyone tried implementing any automation for this yet?
>



signature.asc
Description: OpenPGP digital signature


Re: long lived standalone job session cluster in kubernetes

2018-12-04 Thread Andrey Zagrebin
Hi Derek,

I think your automation steps look good. 
Recreating deployments should not take long 
and as you mention, this way you can avoid unpredictable old/new version 
collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz  wrote:
> 
> Hi Derek,
> 
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
> 
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
> 
> Best,
> 
> Dawid
> 
> [1] https://data-artisans.com/platform-overview
> 
> On 30/11/2018 02:10, Derek VerLee wrote:
>> 
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>> 
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>> 
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>> 
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>> 
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>> 
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>> 
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>> 
>> 
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>> 
>> Does that sound right, or am I overthinking it?
>> 
>> If not, has anyone tried implementing any automation for this yet?
>> 
> 



Re: long lived standalone job session cluster in kubernetes

2018-12-04 Thread Till Rohrmann
Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint
command [1]. This will create a savepoint and terminate the job execution.
Next you simply need to respawn the job cluster which you provide with the
savepoint to resume from.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint

Cheers,
Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
wrote:

> Hi Derek,
>
> I think your automation steps look good.
> Recreating deployments should not take long
> and as you mention, this way you can avoid unpredictable old/new version
> collisions.
>
> Best,
> Andrey
>
> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz 
> wrote:
> >
> > Hi Derek,
> >
> > I am not an expert in kubernetes, so I will cc Till, who should be able
> > to help you more.
> >
> > As for the automation for similar process I would recommend having a
> > look at dA platform[1] which is built on top of kubernetes.
> >
> > Best,
> >
> > Dawid
> >
> > [1] https://data-artisans.com/platform-overview
> >
> > On 30/11/2018 02:10, Derek VerLee wrote:
> >>
> >> I'm looking at the job cluster mode, it looks great and I and
> >> considering migrating our jobs off our "legacy" session cluster and
> >> into Kubernetes.
> >>
> >> I do need to ask some questions because I haven't found a lot of
> >> details in the documentation about how it works yet, and I gave up
> >> following the the DI around in the code after a while.
> >>
> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
> >> another deployment for the taskmanagers.
> >>
> >> I want to upgrade the code or configuration and start from a
> >> savepoint, in an automated way.
> >>
> >> Best I can figure, I can not just update the deployment resources in
> >> kubernetes and allow the containers to restart in an arbitrary order.
> >>
> >> Instead, I expect sequencing is important, something along the lines
> >> of this:
> >>
> >> 1. issue savepoint command on leader
> >> 2. wait for savepoint
> >> 3. destroy all leader and taskmanager containers
> >> 4. deploy new leader, with savepoint url
> >> 5. deploy new taskmanagers
> >>
> >>
> >> For example, I imagine old taskmanagers (with an old version of my
> >> job) attaching to the new leader and causing a problem.
> >>
> >> Does that sound right, or am I overthinking it?
> >>
> >> If not, has anyone tried implementing any automation for this yet?
> >>
> >
>
>


Re: long lived standalone job session cluster in kubernetes

2018-12-05 Thread Derek VerLee

  
  
Sounds good.
Is someone working on this automation today?
If not, although my time is tight, I may be able to work on a PR
  for getting us started down the path Kubernetes native cluster
  mode.



On 12/4/18 5:35 AM, Till Rohrmann
  wrote:


  
  
Hi Derek,
  
  
  what I would recommend to use is to trigger the cancel
with savepoint command [1]. This will create a savepoint and
terminate the job execution. Next you simply need to respawn
the job cluster which you provide with the savepoint to
resume from.
  
  
  [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
  
  
  Cheers,
  Till

  
  
  
On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin
  
  wrote:

Hi Derek,
  
  I think your automation steps look good. 
  Recreating deployments should not take long 
  and as you mention, this way you can avoid unpredictable
  old/new version collisions.
  
  Best,
  Andrey
  
  > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  wrote:
  > 
  > Hi Derek,
  > 
  > I am not an expert in kubernetes, so I will cc Till, who
  should be able
  > to help you more.
  > 
  > As for the automation for similar process I would
  recommend having a
  > look at dA platform[1] which is built on top of
  kubernetes.
  > 
  > Best,
  > 
  > Dawid
  > 
  > [1] https://data-artisans.com/platform-overview
  > 
  > On 30/11/2018 02:10, Derek VerLee wrote:
  >> 
  >> I'm looking at the job cluster mode, it looks great
  and I and
  >> considering migrating our jobs off our "legacy"
  session cluster and
  >> into Kubernetes.
  >> 
  >> I do need to ask some questions because I haven't
  found a lot of
  >> details in the documentation about how it works yet,
  and I gave up
  >> following the the DI around in the code after a
  while.
  >> 
  >> Let's say I have a deployment for the job "leader" in
  HA with ZK, and
  >> another deployment for the taskmanagers.
  >> 
  >> I want to upgrade the code or configuration and start
  from a
  >> savepoint, in an automated way.
  >> 
  >> Best I can figure, I can not just update the
  deployment resources in
  >> kubernetes and allow the containers to restart in an
  arbitrary order.
  >> 
  >> Instead, I expect sequencing is important, something
  along the lines
  >> of this:
  >> 
  >> 1. issue savepoint command on leader
  >> 2. wait for savepoint
  >> 3. destroy all leader and taskmanager containers
  >> 4. deploy new leader, with savepoint url
  >> 5. deploy new taskmanagers
  >> 
  >> 
  >> For example, I imagine old taskmanagers (with an old
  version of my
  >> job) attaching to the new leader and causing a
  problem.
  >> 
  >> Does that sound right, or am I overthinking it?
  >> 
  >> If not, has anyone tried implementing any automation
  for this yet?
  >> 
  > 
  

  

  



Re: long lived standalone job session cluster in kubernetes

2018-12-05 Thread Till Rohrmann
Hi Derek,

there is this issue [1] which tracks the active Kubernetes integration. Jin
Sun already started implementing some parts of it. There should also be
some PRs open for it. Please check them out.

[1] https://issues.apache.org/jira/browse/FLINK-9953

Cheers,
Till

On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  wrote:

> Sounds good.
>
> Is someone working on this automation today?
>
> If not, although my time is tight, I may be able to work on a PR for
> getting us started down the path Kubernetes native cluster mode.
>
>
> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>
> Hi Derek,
>
> what I would recommend to use is to trigger the cancel with savepoint
> command [1]. This will create a savepoint and terminate the job execution.
> Next you simply need to respawn the job cluster which you provide with the
> savepoint to resume from.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>
> Cheers,
> Till
>
> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
> wrote:
>
>> Hi Derek,
>>
>> I think your automation steps look good.
>> Recreating deployments should not take long
>> and as you mention, this way you can avoid unpredictable old/new version
>> collisions.
>>
>> Best,
>> Andrey
>>
>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz 
>> wrote:
>> >
>> > Hi Derek,
>> >
>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>> > to help you more.
>> >
>> > As for the automation for similar process I would recommend having a
>> > look at dA platform[1] which is built on top of kubernetes.
>> >
>> > Best,
>> >
>> > Dawid
>> >
>> > [1] https://data-artisans.com/platform-overview
>> >
>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>
>> >> I'm looking at the job cluster mode, it looks great and I and
>> >> considering migrating our jobs off our "legacy" session cluster and
>> >> into Kubernetes.
>> >>
>> >> I do need to ask some questions because I haven't found a lot of
>> >> details in the documentation about how it works yet, and I gave up
>> >> following the the DI around in the code after a while.
>> >>
>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> >> another deployment for the taskmanagers.
>> >>
>> >> I want to upgrade the code or configuration and start from a
>> >> savepoint, in an automated way.
>> >>
>> >> Best I can figure, I can not just update the deployment resources in
>> >> kubernetes and allow the containers to restart in an arbitrary order.
>> >>
>> >> Instead, I expect sequencing is important, something along the lines
>> >> of this:
>> >>
>> >> 1. issue savepoint command on leader
>> >> 2. wait for savepoint
>> >> 3. destroy all leader and taskmanager containers
>> >> 4. deploy new leader, with savepoint url
>> >> 5. deploy new taskmanagers
>> >>
>> >>
>> >> For example, I imagine old taskmanagers (with an old version of my
>> >> job) attaching to the new leader and causing a problem.
>> >>
>> >> Does that sound right, or am I overthinking it?
>> >>
>> >> If not, has anyone tried implementing any automation for this yet?
>> >>
>> >
>>
>>


Re: long lived standalone job session cluster in kubernetes

2019-02-08 Thread Heath Albritton
Has any progress been made on this?  There are a number of folks in
the community looking to help out.


-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann  wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin 
> Sun already started implementing some parts of it. There should also be some 
> PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting 
>> us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint 
>> command [1]. This will create a savepoint and terminate the job execution. 
>> Next you simply need to respawn the job cluster which you provide with the 
>> savepoint to resume from.
>>
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin  
>> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version 
>>> collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>


Re: long lived standalone job session cluster in kubernetes

2019-02-11 Thread Till Rohrmann
Hi Heath,

I just learned that people from Alibaba already made some good progress
with FLINK-9953. I'm currently talking to them in order to see how we can
merge this contribution into Flink as fast as possible. Since I'm quite
busy due to the upcoming release I hope that other community members will
help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:

> Has any progress been made on this?  There are a number of folks in
> the community looking to help out.
>
>
> -H
>
> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann 
> wrote:
> >
> > Hi Derek,
> >
> > there is this issue [1] which tracks the active Kubernetes integration.
> Jin Sun already started implementing some parts of it. There should also be
> some PRs open for it. Please check them out.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-9953
> >
> > Cheers,
> > Till
> >
> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee 
> wrote:
> >>
> >> Sounds good.
> >>
> >> Is someone working on this automation today?
> >>
> >> If not, although my time is tight, I may be able to work on a PR for
> getting us started down the path Kubernetes native cluster mode.
> >>
> >>
> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
> >>
> >> Hi Derek,
> >>
> >> what I would recommend to use is to trigger the cancel with savepoint
> command [1]. This will create a savepoint and terminate the job execution.
> Next you simply need to respawn the job cluster which you provide with the
> savepoint to resume from.
> >>
> >> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
> >>
> >> Cheers,
> >> Till
> >>
> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
> and...@data-artisans.com> wrote:
> >>>
> >>> Hi Derek,
> >>>
> >>> I think your automation steps look good.
> >>> Recreating deployments should not take long
> >>> and as you mention, this way you can avoid unpredictable old/new
> version collisions.
> >>>
> >>> Best,
> >>> Andrey
> >>>
> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz 
> wrote:
> >>> >
> >>> > Hi Derek,
> >>> >
> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
> able
> >>> > to help you more.
> >>> >
> >>> > As for the automation for similar process I would recommend having a
> >>> > look at dA platform[1] which is built on top of kubernetes.
> >>> >
> >>> > Best,
> >>> >
> >>> > Dawid
> >>> >
> >>> > [1] https://data-artisans.com/platform-overview
> >>> >
> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
> >>> >>
> >>> >> I'm looking at the job cluster mode, it looks great and I and
> >>> >> considering migrating our jobs off our "legacy" session cluster and
> >>> >> into Kubernetes.
> >>> >>
> >>> >> I do need to ask some questions because I haven't found a lot of
> >>> >> details in the documentation about how it works yet, and I gave up
> >>> >> following the the DI around in the code after a while.
> >>> >>
> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
> and
> >>> >> another deployment for the taskmanagers.
> >>> >>
> >>> >> I want to upgrade the code or configuration and start from a
> >>> >> savepoint, in an automated way.
> >>> >>
> >>> >> Best I can figure, I can not just update the deployment resources in
> >>> >> kubernetes and allow the containers to restart in an arbitrary
> order.
> >>> >>
> >>> >> Instead, I expect sequencing is important, something along the lines
> >>> >> of this:
> >>> >>
> >>> >> 1. issue savepoint command on leader
> >>> >> 2. wait for savepoint
> >>> >> 3. destroy all leader and taskmanager containers
> >>> >> 4. deploy new leader, with savepoint url
> >>> >> 5. deploy new taskmanagers
> >>> >>
> >>> >>
> >>> >> For example, I imagine old taskmanagers (with an old version of my
> >>> >> job) attaching to the new leader and causing a problem.
> >>> >>
> >>> >> Does that sound right, or am I overthinking it?
> >>> >>
> >>> >> If not, has anyone tried implementing any automation for this yet?
> >>> >>
> >>> >
> >>>
>


Re: long lived standalone job session cluster in kubernetes

2019-02-14 Thread Heath Albritton
My team and I are keen to help out with testing and review as soon as there is 
a pill request.

-H

> On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
> 
> Hi Heath,
> 
> I just learned that people from Alibaba already made some good progress with 
> FLINK-9953. I'm currently talking to them in order to see how we can merge 
> this contribution into Flink as fast as possible. Since I'm quite busy due to 
> the upcoming release I hope that other community members will help out with 
> the reviewing once the PRs are opened.
> 
> Cheers,
> Till
> 
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:
>> Has any progress been made on this?  There are a number of folks in
>> the community looking to help out.
>> 
>> 
>> -H
>> 
>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann  wrote:
>> >
>> > Hi Derek,
>> >
>> > there is this issue [1] which tracks the active Kubernetes integration. 
>> > Jin Sun already started implementing some parts of it. There should also 
>> > be some PRs open for it. Please check them out.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  wrote:
>> >>
>> >> Sounds good.
>> >>
>> >> Is someone working on this automation today?
>> >>
>> >> If not, although my time is tight, I may be able to work on a PR for 
>> >> getting us started down the path Kubernetes native cluster mode.
>> >>
>> >>
>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>> >>
>> >> Hi Derek,
>> >>
>> >> what I would recommend to use is to trigger the cancel with savepoint 
>> >> command [1]. This will create a savepoint and terminate the job 
>> >> execution. Next you simply need to respawn the job cluster which you 
>> >> provide with the savepoint to resume from.
>> >>
>> >> [1] 
>> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
>> >>  wrote:
>> >>>
>> >>> Hi Derek,
>> >>>
>> >>> I think your automation steps look good.
>> >>> Recreating deployments should not take long
>> >>> and as you mention, this way you can avoid unpredictable old/new version 
>> >>> collisions.
>> >>>
>> >>> Best,
>> >>> Andrey
>> >>>
>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  
>> >>> > wrote:
>> >>> >
>> >>> > Hi Derek,
>> >>> >
>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>> >>> > to help you more.
>> >>> >
>> >>> > As for the automation for similar process I would recommend having a
>> >>> > look at dA platform[1] which is built on top of kubernetes.
>> >>> >
>> >>> > Best,
>> >>> >
>> >>> > Dawid
>> >>> >
>> >>> > [1] https://data-artisans.com/platform-overview
>> >>> >
>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>> >>
>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>> >>> >> into Kubernetes.
>> >>> >>
>> >>> >> I do need to ask some questions because I haven't found a lot of
>> >>> >> details in the documentation about how it works yet, and I gave up
>> >>> >> following the the DI around in the code after a while.
>> >>> >>
>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> >>> >> another deployment for the taskmanagers.
>> >>> >>
>> >>> >> I want to upgrade the code or configuration and start from a
>> >>> >> savepoint, in an automated way.
>> >>> >>
>> >>> >> Best I can figure, I can not just update the deployment resources in
>> >>> >> kubernetes and allow the containers to restart in an arbitrary order.
>> >>> >>
>> >>> >> Instead, I expect sequencing is important, something along the lines
>> >>> >> of this:
>> >>> >>
>> >>> >> 1. issue savepoint command on leader
>> >>> >> 2. wait for savepoint
>> >>> >> 3. destroy all leader and taskmanager containers
>> >>> >> 4. deploy new leader, with savepoint url
>> >>> >> 5. deploy new taskmanagers
>> >>> >>
>> >>> >>
>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>> >>> >> job) attaching to the new leader and causing a problem.
>> >>> >>
>> >>> >> Does that sound right, or am I overthinking it?
>> >>> >>
>> >>> >> If not, has anyone tried implementing any automation for this yet?
>> >>> >>
>> >>> >
>> >>>


Re: long lived standalone job session cluster in kubernetes

2019-02-15 Thread Till Rohrmann
Alright, I'll get back to you once the PRs are open. Thanks a lot for your
help :-)

Cheers,
Till

On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton  wrote:

> My team and I are keen to help out with testing and review as soon as
> there is a pill request.
>
> -H
>
> On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
>
> Hi Heath,
>
> I just learned that people from Alibaba already made some good progress
> with FLINK-9953. I'm currently talking to them in order to see how we can
> merge this contribution into Flink as fast as possible. Since I'm quite
> busy due to the upcoming release I hope that other community members will
> help out with the reviewing once the PRs are opened.
>
> Cheers,
> Till
>
> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:
>
>> Has any progress been made on this?  There are a number of folks in
>> the community looking to help out.
>>
>>
>> -H
>>
>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann 
>> wrote:
>> >
>> > Hi Derek,
>> >
>> > there is this issue [1] which tracks the active Kubernetes integration.
>> Jin Sun already started implementing some parts of it. There should also be
>> some PRs open for it. Please check them out.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee 
>> wrote:
>> >>
>> >> Sounds good.
>> >>
>> >> Is someone working on this automation today?
>> >>
>> >> If not, although my time is tight, I may be able to work on a PR for
>> getting us started down the path Kubernetes native cluster mode.
>> >>
>> >>
>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>> >>
>> >> Hi Derek,
>> >>
>> >> what I would recommend to use is to trigger the cancel with savepoint
>> command [1]. This will create a savepoint and terminate the job execution.
>> Next you simply need to respawn the job cluster which you provide with the
>> savepoint to resume from.
>> >>
>> >> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>> and...@data-artisans.com> wrote:
>> >>>
>> >>> Hi Derek,
>> >>>
>> >>> I think your automation steps look good.
>> >>> Recreating deployments should not take long
>> >>> and as you mention, this way you can avoid unpredictable old/new
>> version collisions.
>> >>>
>> >>> Best,
>> >>> Andrey
>> >>>
>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz 
>> wrote:
>> >>> >
>> >>> > Hi Derek,
>> >>> >
>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
>> able
>> >>> > to help you more.
>> >>> >
>> >>> > As for the automation for similar process I would recommend having a
>> >>> > look at dA platform[1] which is built on top of kubernetes.
>> >>> >
>> >>> > Best,
>> >>> >
>> >>> > Dawid
>> >>> >
>> >>> > [1] https://data-artisans.com/platform-overview
>> >>> >
>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>> >>
>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>> >>> >> into Kubernetes.
>> >>> >>
>> >>> >> I do need to ask some questions because I haven't found a lot of
>> >>> >> details in the documentation about how it works yet, and I gave up
>> >>> >> following the the DI around in the code after a while.
>> >>> >>
>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
>> and
>> >>> >> another deployment for the taskmanagers.
>> >>> >>
>> >>> >> I want to upgrade the code or configuration and start from a
>> >>> >> savepoint, in an automated way.
>> >>> >>
>> >>> >> Best I can figure, I can not just update the deployment resources
>> in
>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>> order.
>> >>> >>
>> >>> >> Instead, I expect sequencing is important, something along the
>> lines
>> >>> >> of this:
>> >>> >>
>> >>> >> 1. issue savepoint command on leader
>> >>> >> 2. wait for savepoint
>> >>> >> 3. destroy all leader and taskmanager containers
>> >>> >> 4. deploy new leader, with savepoint url
>> >>> >> 5. deploy new taskmanagers
>> >>> >>
>> >>> >>
>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>> >>> >> job) attaching to the new leader and causing a problem.
>> >>> >>
>> >>> >> Does that sound right, or am I overthinking it?
>> >>> >>
>> >>> >> If not, has anyone tried implementing any automation for this yet?
>> >>> >>
>> >>> >
>> >>>
>>
>


Re: long lived standalone job session cluster in kubernetes

2019-02-26 Thread Chunhui Shi
Hi Heath and Till, thanks for offering help on reviewing this feature. I
just reassigned the JIRAs to myself after offline discussion with Jin. Let
us work together to get kubernetes integrated natively with flink. Thanks.

On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann  wrote:

> Alright, I'll get back to you once the PRs are open. Thanks a lot for your
> help :-)
>
> Cheers,
> Till
>
> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton  wrote:
>
>> My team and I are keen to help out with testing and review as soon as
>> there is a pill request.
>>
>> -H
>>
>> On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
>>
>> Hi Heath,
>>
>> I just learned that people from Alibaba already made some good progress
>> with FLINK-9953. I'm currently talking to them in order to see how we can
>> merge this contribution into Flink as fast as possible. Since I'm quite
>> busy due to the upcoming release I hope that other community members will
>> help out with the reviewing once the PRs are opened.
>>
>> Cheers,
>> Till
>>
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:
>>
>>> Has any progress been made on this?  There are a number of folks in
>>> the community looking to help out.
>>>
>>>
>>> -H
>>>
>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann 
>>> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > there is this issue [1] which tracks the active Kubernetes
>>> integration. Jin Sun already started implementing some parts of it. There
>>> should also be some PRs open for it. Please check them out.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee 
>>> wrote:
>>> >>
>>> >> Sounds good.
>>> >>
>>> >> Is someone working on this automation today?
>>> >>
>>> >> If not, although my time is tight, I may be able to work on a PR for
>>> getting us started down the path Kubernetes native cluster mode.
>>> >>
>>> >>
>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>> >>
>>> >> Hi Derek,
>>> >>
>>> >> what I would recommend to use is to trigger the cancel with savepoint
>>> command [1]. This will create a savepoint and terminate the job execution.
>>> Next you simply need to respawn the job cluster which you provide with the
>>> savepoint to resume from.
>>> >>
>>> >> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>> >>
>>> >> Cheers,
>>> >> Till
>>> >>
>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>>> and...@data-artisans.com> wrote:
>>> >>>
>>> >>> Hi Derek,
>>> >>>
>>> >>> I think your automation steps look good.
>>> >>> Recreating deployments should not take long
>>> >>> and as you mention, this way you can avoid unpredictable old/new
>>> version collisions.
>>> >>>
>>> >>> Best,
>>> >>> Andrey
>>> >>>
>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz 
>>> wrote:
>>> >>> >
>>> >>> > Hi Derek,
>>> >>> >
>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
>>> able
>>> >>> > to help you more.
>>> >>> >
>>> >>> > As for the automation for similar process I would recommend having
>>> a
>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>> >>> >
>>> >>> > Best,
>>> >>> >
>>> >>> > Dawid
>>> >>> >
>>> >>> > [1] https://data-artisans.com/platform-overview
>>> >>> >
>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>> >>
>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >>> >> considering migrating our jobs off our "legacy" session cluster
>>> and
>>> >>> >> into Kubernetes.
>>> >>> >>
>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>> >>> >> details in the documentation about how it works yet, and I gave up
>>> >>> >> following the the DI around in the code after a while.
>>> >>> >>
>>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
>>> and
>>> >>> >> another deployment for the taskmanagers.
>>> >>> >>
>>> >>> >> I want to upgrade the code or configuration and start from a
>>> >>> >> savepoint, in an automated way.
>>> >>> >>
>>> >>> >> Best I can figure, I can not just update the deployment resources
>>> in
>>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>>> order.
>>> >>> >>
>>> >>> >> Instead, I expect sequencing is important, something along the
>>> lines
>>> >>> >> of this:
>>> >>> >>
>>> >>> >> 1. issue savepoint command on leader
>>> >>> >> 2. wait for savepoint
>>> >>> >> 3. destroy all leader and taskmanager containers
>>> >>> >> 4. deploy new leader, with savepoint url
>>> >>> >> 5. deploy new taskmanagers
>>> >>> >>
>>> >>> >>
>>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >>> >> job) attaching to the new leader and causing a problem.
>>> >>> >>
>>> >>> >> Does that sound right, or am I overthinking it?
>>> >>> >>
>>> >>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>> >>
>>> >>> >
>>> >>>
>>>
>>


Re: long lived standalone job session cluster in kubernetes

2019-02-27 Thread Heath Albritton
Great, my team is eager to get started.  I’m curious what progress had been 
made so far?

-H

> On Feb 26, 2019, at 14:43, Chunhui Shi  wrote:
> 
> Hi Heath and Till, thanks for offering help on reviewing this feature. I just 
> reassigned the JIRAs to myself after offline discussion with Jin. Let us work 
> together to get kubernetes integrated natively with flink. Thanks.
> 
>> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann  wrote:
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for your 
>> help :-)
>> 
>> Cheers,
>> Till
>> 
>>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton  wrote:
>>> My team and I are keen to help out with testing and review as soon as there 
>>> is a pill request.
>>> 
>>> -H
>>> 
 On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
 
 Hi Heath,
 
 I just learned that people from Alibaba already made some good progress 
 with FLINK-9953. I'm currently talking to them in order to see how we can 
 merge this contribution into Flink as fast as possible. Since I'm quite 
 busy due to the upcoming release I hope that other community members will 
 help out with the reviewing once the PRs are opened.
 
 Cheers,
 Till
 
> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton  wrote:
> Has any progress been made on this?  There are a number of folks in
> the community looking to help out.
> 
> 
> -H
> 
> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann  
> wrote:
> >
> > Hi Derek,
> >
> > there is this issue [1] which tracks the active Kubernetes integration. 
> > Jin Sun already started implementing some parts of it. There should 
> > also be some PRs open for it. Please check them out.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-9953
> >
> > Cheers,
> > Till
> >
> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee  
> > wrote:
> >>
> >> Sounds good.
> >>
> >> Is someone working on this automation today?
> >>
> >> If not, although my time is tight, I may be able to work on a PR for 
> >> getting us started down the path Kubernetes native cluster mode.
> >>
> >>
> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
> >>
> >> Hi Derek,
> >>
> >> what I would recommend to use is to trigger the cancel with savepoint 
> >> command [1]. This will create a savepoint and terminate the job 
> >> execution. Next you simply need to respawn the job cluster which you 
> >> provide with the savepoint to resume from.
> >>
> >> [1] 
> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
> >>
> >> Cheers,
> >> Till
> >>
> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
> >>  wrote:
> >>>
> >>> Hi Derek,
> >>>
> >>> I think your automation steps look good.
> >>> Recreating deployments should not take long
> >>> and as you mention, this way you can avoid unpredictable old/new 
> >>> version collisions.
> >>>
> >>> Best,
> >>> Andrey
> >>>
> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz  
> >>> > wrote:
> >>> >
> >>> > Hi Derek,
> >>> >
> >>> > I am not an expert in kubernetes, so I will cc Till, who should be 
> >>> > able
> >>> > to help you more.
> >>> >
> >>> > As for the automation for similar process I would recommend having a
> >>> > look at dA platform[1] which is built on top of kubernetes.
> >>> >
> >>> > Best,
> >>> >
> >>> > Dawid
> >>> >
> >>> > [1] https://data-artisans.com/platform-overview
> >>> >
> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
> >>> >>
> >>> >> I'm looking at the job cluster mode, it looks great and I and
> >>> >> considering migrating our jobs off our "legacy" session cluster and
> >>> >> into Kubernetes.
> >>> >>
> >>> >> I do need to ask some questions because I haven't found a lot of
> >>> >> details in the documentation about how it works yet, and I gave up
> >>> >> following the the DI around in the code after a while.
> >>> >>
> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, 
> >>> >> and
> >>> >> another deployment for the taskmanagers.
> >>> >>
> >>> >> I want to upgrade the code or configuration and start from a
> >>> >> savepoint, in an automated way.
> >>> >>
> >>> >> Best I can figure, I can not just update the deployment resources 
> >>> >> in
> >>> >> kubernetes and allow the containers to restart in an arbitrary 
> >>> >> order.
> >>> >>
> >>> >> Instead, I expect sequencing is important, something along the 
> >>> >> lines
> >>> >> of this:
> >>> >>
> >>> >> 1. issue savepoint command on leader
> >>> >> 2. wait for savepoint
> >>> >> 3. destroy all leader and taskmanager containers
> >>

Re: long lived standalone job session cluster in kubernetes

2019-04-02 Thread Till Rohrmann
Hi Heath,

I think some of the PRs are already open and ready for review [1, 2].

[1] https://issues.apache.org/jira/browse/FLINK-10932
[2] https://issues.apache.org/jira/browse/FLINK-10935

Cheers,
Till

On Wed, Feb 27, 2019 at 10:48 AM Heath Albritton  wrote:

> Great, my team is eager to get started.  I’m curious what progress had
> been made so far?
>
> -H
>
> On Feb 26, 2019, at 14:43, Chunhui Shi  wrote:
>
> Hi Heath and Till, thanks for offering help on reviewing this feature. I
> just reassigned the JIRAs to myself after offline discussion with Jin. Let
> us work together to get kubernetes integrated natively with flink. Thanks.
>
> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann 
> wrote:
>
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for
>> your help :-)
>>
>> Cheers,
>> Till
>>
>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton 
>> wrote:
>>
>>> My team and I are keen to help out with testing and review as soon as
>>> there is a pill request.
>>>
>>> -H
>>>
>>> On Feb 11, 2019, at 00:26, Till Rohrmann  wrote:
>>>
>>> Hi Heath,
>>>
>>> I just learned that people from Alibaba already made some good progress
>>> with FLINK-9953. I'm currently talking to them in order to see how we can
>>> merge this contribution into Flink as fast as possible. Since I'm quite
>>> busy due to the upcoming release I hope that other community members will
>>> help out with the reviewing once the PRs are opened.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton 
>>> wrote:
>>>
 Has any progress been made on this?  There are a number of folks in
 the community looking to help out.


 -H

 On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann 
 wrote:
 >
 > Hi Derek,
 >
 > there is this issue [1] which tracks the active Kubernetes
 integration. Jin Sun already started implementing some parts of it. There
 should also be some PRs open for it. Please check them out.
 >
 > [1] https://issues.apache.org/jira/browse/FLINK-9953
 >
 > Cheers,
 > Till
 >
 > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee 
 wrote:
 >>
 >> Sounds good.
 >>
 >> Is someone working on this automation today?
 >>
 >> If not, although my time is tight, I may be able to work on a PR for
 getting us started down the path Kubernetes native cluster mode.
 >>
 >>
 >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
 >>
 >> Hi Derek,
 >>
 >> what I would recommend to use is to trigger the cancel with
 savepoint command [1]. This will create a savepoint and terminate the job
 execution. Next you simply need to respawn the job cluster which you
 provide with the savepoint to resume from.
 >>
 >> [1]
 https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
 >>
 >> Cheers,
 >> Till
 >>
 >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
 and...@data-artisans.com> wrote:
 >>>
 >>> Hi Derek,
 >>>
 >>> I think your automation steps look good.
 >>> Recreating deployments should not take long
 >>> and as you mention, this way you can avoid unpredictable old/new
 version collisions.
 >>>
 >>> Best,
 >>> Andrey
 >>>
 >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz 
 wrote:
 >>> >
 >>> > Hi Derek,
 >>> >
 >>> > I am not an expert in kubernetes, so I will cc Till, who should
 be able
 >>> > to help you more.
 >>> >
 >>> > As for the automation for similar process I would recommend
 having a
 >>> > look at dA platform[1] which is built on top of kubernetes.
 >>> >
 >>> > Best,
 >>> >
 >>> > Dawid
 >>> >
 >>> > [1] https://data-artisans.com/platform-overview
 >>> >
 >>> > On 30/11/2018 02:10, Derek VerLee wrote:
 >>> >>
 >>> >> I'm looking at the job cluster mode, it looks great and I and
 >>> >> considering migrating our jobs off our "legacy" session cluster
 and
 >>> >> into Kubernetes.
 >>> >>
 >>> >> I do need to ask some questions because I haven't found a lot of
 >>> >> details in the documentation about how it works yet, and I gave
 up
 >>> >> following the the DI around in the code after a while.
 >>> >>
 >>> >> Let's say I have a deployment for the job "leader" in HA with
 ZK, and
 >>> >> another deployment for the taskmanagers.
 >>> >>
 >>> >> I want to upgrade the code or configuration and start from a
 >>> >> savepoint, in an automated way.
 >>> >>
 >>> >> Best I can figure, I can not just update the deployment
 resources in
 >>> >> kubernetes and allow the containers to restart in an arbitrary
 order.
 >>> >>
 >>> >> Instead, I expect sequencing is important, something along the
 lines
 >>> >> of this:
 >>> >>
 >>> >> 1. issue savepoint command on leader
>>