subject:"Re\: \[DISCUSS\] FLIP\-211\: Kerberos delegation token framework"

Sorry I didn't want to offend anybody if it was perceived like this. I can
see that me joining very late into the discussion w/o constructive ideas
was not nice. My motivation for asking for the reasoning behind the current
design proposal is primarily the lack of Kerberos knowledge. Moreover, it
happened before that we moved responsibilities into Flink that we regretted
later.

As I've said, I don't have a better idea right now. If we believe that it
is the right thing to make Flink responsible for distributing the tokens
and we don't find a better solution then we'll go for it. I just wanted to
make sure that we don't overlook an alternative solution that might be
easier to maintain in the long run.

Cheers,
Till

On Thu, Feb 3, 2022 at 7:52 PM Gyula Fóra  wrote:

> Hi Team!
>
> Let's all calm down a little and not let our emotions affect the discussion
> too much.
> There has been a lot of effort spent from all involved parties so this is
> quite understandable :)
>
> Even though not everyone said this explicitly, it seems that everyone more
> or less agrees that a feature implementing token renewal is necessary and
> valuable.
>
> The main point of contention is: where should the token renewal
> logic run and how to get the tokens to wherever needed.
>
> From my perspective the current design is very reasonable at first sight
> because:
>  1. It runs the token renewal in a single place avoiding extra CDC workload
>  2. Does not introduce new processes, extra communication channels etc but
> piggybacks on existing robust mechanisms.
>
> I understand the concerns about adding new things in the resource manager
> but I think that really depends on how we look at it.
> We cannot reasonably expect a custom token renewal process to have it's own
> secure distribution logic like Flink has now, that is a complete overkill.
> This practically means that we will not have a slim efficient
> implementation for this but something unnecessarily complex. And the only
> thing we get in return is a bit less code in the resource manager.
>
> From a logical standpoint the delegation framework needs to run at a
> centralized place and need to be able to access new task manager processes
> to achieve all it's design goals.
> We can drop a single renewer as a design goal but that might be a decision
> that can affect large scale production runs.
>
> Cheers,
> Gyula
>
>
>
>
> On Thu, Feb 3, 2022 at 7:32 PM Chesnay Schepler 
> wrote:
>
> > First of, at no point have we questioned the use-case and importance of
> > this feature, and the fact that David, Till and me spent time looking at
> > the FLIP, asking questions, and discussing different aspects of it
> > should make this obvious.
> >
> > I'd appreciate it if you didn't dismiss our replies that quickly.
> >
> >  > Ok, so we declare that users who try to use delegation tokens in
> > Flink is dead end code and not supported, right?
> >
> > No one has said that. Are you claiming that your design is the /only
> > possible implementation/ that is capable of achieving the stated goals,
> > that there are 0 alternatives? On of the *main**points* of these
> > discussion threads is to discover alternative implementations that maybe
> > weren't thought of. Yes, that may imply that we amend your design, or
> > reject it completely and come up with a new one.
> >
> >
> > Let's clarify what (I think) Till proposed to get the imagination juice
> > flowing.
> >
> > At the end of the day, all we need is a way to provide Flink processes
> > with a token that can be periodically updated. _Who_ issues that token
> > is irrelevant for the functionality to work. You are proposing for a new
> > component in the Flink RM to do that; Till is proposing to have some
> > external process do it. *That's it*.
> >
> > How this could look like in practice is fairly straight forwad; add a
> > pluggable interface (aka, your TokenProvider thing) that is loaded in
> > each process, which can _somehow_ provide tokens that are then set in
> > the UserGroupInformation.
> > _How_ the provider receives token is up to the provider. It _may_ just
> > talk directly to Kerberos, or it could use some communication channel to
> > accept tokens from the outside.
> > This would for example make it a lot easier to properly integrate this
> > into the lifecycle of the process, as we'd sidestep the whole "TM is
> > running but still needs a Token" issue; it could become a proper setup
> > step of the process that is independent from other Flink processes.
> >
> > /Discuss/.
> >
> > On 03/02/2022 18:57, Gabor Somogyi wrote:
> > >> And even
> > > if we do it like this, there is no guarantee that it works because
> there
> > > can be other applications bombing the KDC with requests.
> > >
> > > 1. The main issue to solve here is that workloads using delegation
> tokens
> > > are stopping after 7 days with default configuration.
> > > 2. This is not new design, it's rock stable and performing well in
> Spark
> > > for years.

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-02-03 Thread Gyula Fóra

Hi Team!

Let's all calm down a little and not let our emotions affect the discussion
too much.
There has been a lot of effort spent from all involved parties so this is
quite understandable :)

Even though not everyone said this explicitly, it seems that everyone more
or less agrees that a feature implementing token renewal is necessary and
valuable.

The main point of contention is: where should the token renewal
logic run and how to get the tokens to wherever needed.

>From my perspective the current design is very reasonable at first sight
because:
 1. It runs the token renewal in a single place avoiding extra CDC workload
 2. Does not introduce new processes, extra communication channels etc but
piggybacks on existing robust mechanisms.

I understand the concerns about adding new things in the resource manager
but I think that really depends on how we look at it.
We cannot reasonably expect a custom token renewal process to have it's own
secure distribution logic like Flink has now, that is a complete overkill.
This practically means that we will not have a slim efficient
implementation for this but something unnecessarily complex. And the only
thing we get in return is a bit less code in the resource manager.

>From a logical standpoint the delegation framework needs to run at a
centralized place and need to be able to access new task manager processes
to achieve all it's design goals.
We can drop a single renewer as a design goal but that might be a decision
that can affect large scale production runs.

Cheers,
Gyula




On Thu, Feb 3, 2022 at 7:32 PM Chesnay Schepler  wrote:

> First of, at no point have we questioned the use-case and importance of
> this feature, and the fact that David, Till and me spent time looking at
> the FLIP, asking questions, and discussing different aspects of it
> should make this obvious.
>
> I'd appreciate it if you didn't dismiss our replies that quickly.
>
>  > Ok, so we declare that users who try to use delegation tokens in
> Flink is dead end code and not supported, right?
>
> No one has said that. Are you claiming that your design is the /only
> possible implementation/ that is capable of achieving the stated goals,
> that there are 0 alternatives? On of the *main**points* of these
> discussion threads is to discover alternative implementations that maybe
> weren't thought of. Yes, that may imply that we amend your design, or
> reject it completely and come up with a new one.
>
>
> Let's clarify what (I think) Till proposed to get the imagination juice
> flowing.
>
> At the end of the day, all we need is a way to provide Flink processes
> with a token that can be periodically updated. _Who_ issues that token
> is irrelevant for the functionality to work. You are proposing for a new
> component in the Flink RM to do that; Till is proposing to have some
> external process do it. *That's it*.
>
> How this could look like in practice is fairly straight forwad; add a
> pluggable interface (aka, your TokenProvider thing) that is loaded in
> each process, which can _somehow_ provide tokens that are then set in
> the UserGroupInformation.
> _How_ the provider receives token is up to the provider. It _may_ just
> talk directly to Kerberos, or it could use some communication channel to
> accept tokens from the outside.
> This would for example make it a lot easier to properly integrate this
> into the lifecycle of the process, as we'd sidestep the whole "TM is
> running but still needs a Token" issue; it could become a proper setup
> step of the process that is independent from other Flink processes.
>
> /Discuss/.
>
> On 03/02/2022 18:57, Gabor Somogyi wrote:
> >> And even
> > if we do it like this, there is no guarantee that it works because there
> > can be other applications bombing the KDC with requests.
> >
> > 1. The main issue to solve here is that workloads using delegation tokens
> > are stopping after 7 days with default configuration.
> > 2. This is not new design, it's rock stable and performing well in Spark
> > for years.
> >
> >>  From a
> > maintainability and separation of concerns perspective I'd rather have
> this
> > as some kind of external tool/service that makes KDC scale better and
> that
> > Flink processes can talk to to obtain the tokens.
> >
> > Ok, so we declare that users who try to use delegation tokens in Flink is
> > dead end code and not supported, right? Then this must be explicitely
> > written in the security documentation that such users who use that
> feature
> > are left behind.
> >
> > As I see the discussion turned away from facts and started to speak about
> > feelings. If you have strategic problems with the feature please put your
> > -1 on the vote and we can spare quite some time.
> >
> > G
> >
> >
> > On Thu, 3 Feb 2022, 18:34 Till Rohrmann,  wrote:
> >
> >> I don't have a good alternative solution but it sounds to me a bit as
> if we
> >> are trying to solve Kerberos' scalability problems within Flink. And
> even
> >> if we

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

First of, at no point have we questioned the use-case and importance of 
this feature, and the fact that David, Till and me spent time looking at 
the FLIP, asking questions, and discussing different aspects of it 
should make this obvious.


I'd appreciate it if you didn't dismiss our replies that quickly.

> Ok, so we declare that users who try to use delegation tokens in 
Flink is dead end code and not supported, right?


No one has said that. Are you claiming that your design is the /only 
possible implementation/ that is capable of achieving the stated goals, 
that there are 0 alternatives? On of the *main**points* of these 
discussion threads is to discover alternative implementations that maybe 
weren't thought of. Yes, that may imply that we amend your design, or 
reject it completely and come up with a new one.



Let's clarify what (I think) Till proposed to get the imagination juice 
flowing.


At the end of the day, all we need is a way to provide Flink processes 
with a token that can be periodically updated. _Who_ issues that token 
is irrelevant for the functionality to work. You are proposing for a new 
component in the Flink RM to do that; Till is proposing to have some 
external process do it. *That's it*.


How this could look like in practice is fairly straight forwad; add a 
pluggable interface (aka, your TokenProvider thing) that is loaded in 
each process, which can _somehow_ provide tokens that are then set in 
the UserGroupInformation.
_How_ the provider receives token is up to the provider. It _may_ just 
talk directly to Kerberos, or it could use some communication channel to 
accept tokens from the outside.
This would for example make it a lot easier to properly integrate this 
into the lifecycle of the process, as we'd sidestep the whole "TM is 
running but still needs a Token" issue; it could become a proper setup 
step of the process that is independent from other Flink processes.


/Discuss/.

On 03/02/2022 18:57, Gabor Somogyi wrote:

And even

if we do it like this, there is no guarantee that it works because there
can be other applications bombing the KDC with requests.

1. The main issue to solve here is that workloads using delegation tokens
are stopping after 7 days with default configuration.
2. This is not new design, it's rock stable and performing well in Spark
for years.


 From a

maintainability and separation of concerns perspective I'd rather have this
as some kind of external tool/service that makes KDC scale better and that
Flink processes can talk to to obtain the tokens.

Ok, so we declare that users who try to use delegation tokens in Flink is
dead end code and not supported, right? Then this must be explicitely
written in the security documentation that such users who use that feature
are left behind.

As I see the discussion turned away from facts and started to speak about
feelings. If you have strategic problems with the feature please put your
-1 on the vote and we can spare quite some time.

G


On Thu, 3 Feb 2022, 18:34 Till Rohrmann,  wrote:


I don't have a good alternative solution but it sounds to me a bit as if we
are trying to solve Kerberos' scalability problems within Flink. And even
if we do it like this, there is no guarantee that it works because there
can be other applications bombing the KDC with requests. From a
maintainability and separation of concerns perspective I'd rather have this
as some kind of external tool/service that makes KDC scale better and that
Flink processes can talk to to obtain the tokens.

Cheers,
Till

On Thu, Feb 3, 2022 at 6:01 PM Gabor Somogyi
wrote:


Oh and the most important reason I've forgotten.
Without the feature in the FLIP all secure workloads with delegation

tokens

are going to stop when tokens are reaching it's max lifetime 
This is around 7 days with default config...

On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi
wrote:


That's not the single purpose of the feature but in some environments

it

caused problems.
The main intention is not to deploy keytab to all the nodes because the
attack surface is bigger + reduce the KDC load.
I've already described the situation previously in this thread so

copying

it here.

COPY
"KDC *may* collapse under some circumstances" is the proper wording.

We have several customers who are executing workloads on Spark/Flink.

Most

of the time I'm facing their
daily issues which is heavily environment and use-case dependent. I've
seen various cases:
* where the mentioned ~1k nodes were working fine
* where KDC thought the number of requests are coming from DDOS attack

so

discontinued authentication
* where KDC was simply not responding because of the load
* where KDC was intermittently had some outage (this was the most nasty
thing)

Since you're managing relatively big cluster then you know that KDC is

not

only used by Spark/Flink workloads
but the whole company IT infrastructure is bombing it so it really

depends

on other factors too

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

> And even
if we do it like this, there is no guarantee that it works because there
can be other applications bombing the KDC with requests.

1. The main issue to solve here is that workloads using delegation tokens
are stopping after 7 days with default configuration.
2. This is not new design, it's rock stable and performing well in Spark
for years.

> From a
maintainability and separation of concerns perspective I'd rather have this
as some kind of external tool/service that makes KDC scale better and that
Flink processes can talk to to obtain the tokens.

Ok, so we declare that users who try to use delegation tokens in Flink is
dead end code and not supported, right? Then this must be explicitely
written in the security documentation that such users who use that feature
are left behind.

As I see the discussion turned away from facts and started to speak about
feelings. If you have strategic problems with the feature please put your
-1 on the vote and we can spare quite some time.

G


On Thu, 3 Feb 2022, 18:34 Till Rohrmann,  wrote:

> I don't have a good alternative solution but it sounds to me a bit as if we
> are trying to solve Kerberos' scalability problems within Flink. And even
> if we do it like this, there is no guarantee that it works because there
> can be other applications bombing the KDC with requests. From a
> maintainability and separation of concerns perspective I'd rather have this
> as some kind of external tool/service that makes KDC scale better and that
> Flink processes can talk to to obtain the tokens.
>
> Cheers,
> Till
>
> On Thu, Feb 3, 2022 at 6:01 PM Gabor Somogyi 
> wrote:
>
> > Oh and the most important reason I've forgotten.
> > Without the feature in the FLIP all secure workloads with delegation
> tokens
> > are going to stop when tokens are reaching it's max lifetime 
> > This is around 7 days with default config...
> >
> > On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi 
> > wrote:
> >
> > > That's not the single purpose of the feature but in some environments
> it
> > > caused problems.
> > > The main intention is not to deploy keytab to all the nodes because the
> > > attack surface is bigger + reduce the KDC load.
> > > I've already described the situation previously in this thread so
> copying
> > > it here.
> > >
> > > COPY
> > > "KDC *may* collapse under some circumstances" is the proper wording.
> > >
> > > We have several customers who are executing workloads on Spark/Flink.
> > Most
> > > of the time I'm facing their
> > > daily issues which is heavily environment and use-case dependent. I've
> > > seen various cases:
> > > * where the mentioned ~1k nodes were working fine
> > > * where KDC thought the number of requests are coming from DDOS attack
> so
> > > discontinued authentication
> > > * where KDC was simply not responding because of the load
> > > * where KDC was intermittently had some outage (this was the most nasty
> > > thing)
> > >
> > > Since you're managing relatively big cluster then you know that KDC is
> > not
> > > only used by Spark/Flink workloads
> > > but the whole company IT infrastructure is bombing it so it really
> > depends
> > > on other factors too whether KDC is reaching
> > > it's limit or not. Not sure what kind of evidence are you looking for
> but
> > > I'm not authorized to share any information about
> > > our clients data.
> > >
> > > One thing is for sure. The more external system types are used in
> > > workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > > are authenticating through KDC the more possibility to reach this
> > > threshold when the cluster is big enough.
> > > COPY
> > >
> > > > The FLIP mentions scaling issues with 200 nodes; it's really
> surprising
> > > to me that such a small number of requests can already cause issues.
> > >
> > > One node/task doesn't mean 1 request. The following type of kerberos
> auth
> > > types has been seen by me which can run at the same time:
> > > HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...)
> Additionally
> > > one task is not necessarily opens 1 connection.
> > >
> > > All in all I don't have steps to reproduce but we've faced this
> > already...
> > >
> > > G
> > >
> > >
> > > On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler 
> > > wrote:
> > >
> > >> What I don't understand is how this could overload the KDC. Aren't
> > >> tokens valid for a relatively long time period?
> > >>
> > >> For new deployments where many TMs are started at once I could imagine
> > >> it temporarily, but shouldn't the accesses to the KDC eventually
> > >> naturally spread out?
> > >>
> > >> The FLIP mentions scaling issues with 200 nodes; it's really
> surprising
> > >> to me that such a small number of requests can already cause issues.
> > >>
> > >> On 03/02/2022 16:14, Gabor Somogyi wrote:
> > >> >> I would prefer not choosing the first option
> > >> > Then the second option may play only.
> > >> >
> > >> >> I am not a Kerberos expert but is

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-02-03 Thread Gyula Fóra

Hi Till!

The delegation token framework solves a few production problems, KDC
scalability is just one and probably not the most important.

As Gabor has explained some of which are:
 - Solves the problem for token renewal for long running jobs which would
currently time out and die
 - Improves security by not exposing keytabs on each node
 - Reduces KDC load

I do not think we should reject the design just because one of the things
it solves is not primarily Flink's responsibility.
Even if that is the case I think the other issues like security and general
token renewal seem very important to me.

Cheers,
Gyula

On Thu, Feb 3, 2022 at 6:34 PM Till Rohrmann  wrote:

> I don't have a good alternative solution but it sounds to me a bit as if we
> are trying to solve Kerberos' scalability problems within Flink. And even
> if we do it like this, there is no guarantee that it works because there
> can be other applications bombing the KDC with requests. From a
> maintainability and separation of concerns perspective I'd rather have this
> as some kind of external tool/service that makes KDC scale better and that
> Flink processes can talk to to obtain the tokens.
>
> Cheers,
> Till
>
> On Thu, Feb 3, 2022 at 6:01 PM Gabor Somogyi 
> wrote:
>
> > Oh and the most important reason I've forgotten.
> > Without the feature in the FLIP all secure workloads with delegation
> tokens
> > are going to stop when tokens are reaching it's max lifetime 
> > This is around 7 days with default config...
> >
> > On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi 
> > wrote:
> >
> > > That's not the single purpose of the feature but in some environments
> it
> > > caused problems.
> > > The main intention is not to deploy keytab to all the nodes because the
> > > attack surface is bigger + reduce the KDC load.
> > > I've already described the situation previously in this thread so
> copying
> > > it here.
> > >
> > > COPY
> > > "KDC *may* collapse under some circumstances" is the proper wording.
> > >
> > > We have several customers who are executing workloads on Spark/Flink.
> > Most
> > > of the time I'm facing their
> > > daily issues which is heavily environment and use-case dependent. I've
> > > seen various cases:
> > > * where the mentioned ~1k nodes were working fine
> > > * where KDC thought the number of requests are coming from DDOS attack
> so
> > > discontinued authentication
> > > * where KDC was simply not responding because of the load
> > > * where KDC was intermittently had some outage (this was the most nasty
> > > thing)
> > >
> > > Since you're managing relatively big cluster then you know that KDC is
> > not
> > > only used by Spark/Flink workloads
> > > but the whole company IT infrastructure is bombing it so it really
> > depends
> > > on other factors too whether KDC is reaching
> > > it's limit or not. Not sure what kind of evidence are you looking for
> but
> > > I'm not authorized to share any information about
> > > our clients data.
> > >
> > > One thing is for sure. The more external system types are used in
> > > workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > > are authenticating through KDC the more possibility to reach this
> > > threshold when the cluster is big enough.
> > > COPY
> > >
> > > > The FLIP mentions scaling issues with 200 nodes; it's really
> surprising
> > > to me that such a small number of requests can already cause issues.
> > >
> > > One node/task doesn't mean 1 request. The following type of kerberos
> auth
> > > types has been seen by me which can run at the same time:
> > > HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...)
> Additionally
> > > one task is not necessarily opens 1 connection.
> > >
> > > All in all I don't have steps to reproduce but we've faced this
> > already...
> > >
> > > G
> > >
> > >
> > > On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler 
> > > wrote:
> > >
> > >> What I don't understand is how this could overload the KDC. Aren't
> > >> tokens valid for a relatively long time period?
> > >>
> > >> For new deployments where many TMs are started at once I could imagine
> > >> it temporarily, but shouldn't the accesses to the KDC eventually
> > >> naturally spread out?
> > >>
> > >> The FLIP mentions scaling issues with 200 nodes; it's really
> surprising
> > >> to me that such a small number of requests can already cause issues.
> > >>
> > >> On 03/02/2022 16:14, Gabor Somogyi wrote:
> > >> >> I would prefer not choosing the first option
> > >> > Then the second option may play only.
> > >> >
> > >> >> I am not a Kerberos expert but is it really so that every
> application
> > >> that
> > >> > wants to use Kerberos needs to implement the token propagation
> itself?
> > >> This
> > >> > somehow feels as if there is something missing.
> > >> >
> > >> > OK, so first some kerberos + token intro.
> > >> >
> > >> > Some basics:
> > >> > * TGT can be created from keytab
> > >> > * TGT is needed to obtain TGS (called

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

I don't have a good alternative solution but it sounds to me a bit as if we
are trying to solve Kerberos' scalability problems within Flink. And even
if we do it like this, there is no guarantee that it works because there
can be other applications bombing the KDC with requests. From a
maintainability and separation of concerns perspective I'd rather have this
as some kind of external tool/service that makes KDC scale better and that
Flink processes can talk to to obtain the tokens.

Cheers,
Till

On Thu, Feb 3, 2022 at 6:01 PM Gabor Somogyi 
wrote:

> Oh and the most important reason I've forgotten.
> Without the feature in the FLIP all secure workloads with delegation tokens
> are going to stop when tokens are reaching it's max lifetime 
> This is around 7 days with default config...
>
> On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi 
> wrote:
>
> > That's not the single purpose of the feature but in some environments it
> > caused problems.
> > The main intention is not to deploy keytab to all the nodes because the
> > attack surface is bigger + reduce the KDC load.
> > I've already described the situation previously in this thread so copying
> > it here.
> >
> > COPY
> > "KDC *may* collapse under some circumstances" is the proper wording.
> >
> > We have several customers who are executing workloads on Spark/Flink.
> Most
> > of the time I'm facing their
> > daily issues which is heavily environment and use-case dependent. I've
> > seen various cases:
> > * where the mentioned ~1k nodes were working fine
> > * where KDC thought the number of requests are coming from DDOS attack so
> > discontinued authentication
> > * where KDC was simply not responding because of the load
> > * where KDC was intermittently had some outage (this was the most nasty
> > thing)
> >
> > Since you're managing relatively big cluster then you know that KDC is
> not
> > only used by Spark/Flink workloads
> > but the whole company IT infrastructure is bombing it so it really
> depends
> > on other factors too whether KDC is reaching
> > it's limit or not. Not sure what kind of evidence are you looking for but
> > I'm not authorized to share any information about
> > our clients data.
> >
> > One thing is for sure. The more external system types are used in
> > workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > are authenticating through KDC the more possibility to reach this
> > threshold when the cluster is big enough.
> > COPY
> >
> > > The FLIP mentions scaling issues with 200 nodes; it's really surprising
> > to me that such a small number of requests can already cause issues.
> >
> > One node/task doesn't mean 1 request. The following type of kerberos auth
> > types has been seen by me which can run at the same time:
> > HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...) Additionally
> > one task is not necessarily opens 1 connection.
> >
> > All in all I don't have steps to reproduce but we've faced this
> already...
> >
> > G
> >
> >
> > On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler 
> > wrote:
> >
> >> What I don't understand is how this could overload the KDC. Aren't
> >> tokens valid for a relatively long time period?
> >>
> >> For new deployments where many TMs are started at once I could imagine
> >> it temporarily, but shouldn't the accesses to the KDC eventually
> >> naturally spread out?
> >>
> >> The FLIP mentions scaling issues with 200 nodes; it's really surprising
> >> to me that such a small number of requests can already cause issues.
> >>
> >> On 03/02/2022 16:14, Gabor Somogyi wrote:
> >> >> I would prefer not choosing the first option
> >> > Then the second option may play only.
> >> >
> >> >> I am not a Kerberos expert but is it really so that every application
> >> that
> >> > wants to use Kerberos needs to implement the token propagation itself?
> >> This
> >> > somehow feels as if there is something missing.
> >> >
> >> > OK, so first some kerberos + token intro.
> >> >
> >> > Some basics:
> >> > * TGT can be created from keytab
> >> > * TGT is needed to obtain TGS (called token)
> >> > * Authentication only works with TGS -> all places where external
> >> system is
> >> > needed either a TGT or TGS needed
> >> >
> >> > There are basically 2 ways to authenticate to a kerberos secured
> >> external
> >> > system:
> >> > 1. One needs a kerberos TGT which MUST be propagated to all JVMs. Here
> >> each
> >> > and every JVM obtains a TGS by itself which bombs the KDC that may
> >> collapse.
> >> > 2. One needs a kerberos TGT which exists only on a single place (in
> this
> >> > case JM). JM gets a TGS which MUST be propagated to all TMs because
> >> > otherwise authentication fails.
> >> >
> >> > Now the whole system works in a way that keytab file (we can imagine
> >> that
> >> > as plaintext password) is reachable on all nodes.
> >> > This is a relatively huge attack surface. Now the main intention is:
> >> > * Instead of propagating keytab file to all

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Oh and the most important reason I've forgotten.
Without the feature in the FLIP all secure workloads with delegation tokens
are going to stop when tokens are reaching it's max lifetime 
This is around 7 days with default config...

On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi 
wrote:

> That's not the single purpose of the feature but in some environments it
> caused problems.
> The main intention is not to deploy keytab to all the nodes because the
> attack surface is bigger + reduce the KDC load.
> I've already described the situation previously in this thread so copying
> it here.
>
> COPY
> "KDC *may* collapse under some circumstances" is the proper wording.
>
> We have several customers who are executing workloads on Spark/Flink. Most
> of the time I'm facing their
> daily issues which is heavily environment and use-case dependent. I've
> seen various cases:
> * where the mentioned ~1k nodes were working fine
> * where KDC thought the number of requests are coming from DDOS attack so
> discontinued authentication
> * where KDC was simply not responding because of the load
> * where KDC was intermittently had some outage (this was the most nasty
> thing)
>
> Since you're managing relatively big cluster then you know that KDC is not
> only used by Spark/Flink workloads
> but the whole company IT infrastructure is bombing it so it really depends
> on other factors too whether KDC is reaching
> it's limit or not. Not sure what kind of evidence are you looking for but
> I'm not authorized to share any information about
> our clients data.
>
> One thing is for sure. The more external system types are used in
> workloads (for ex. HDFS, HBase, Hive, Kafka) which
> are authenticating through KDC the more possibility to reach this
> threshold when the cluster is big enough.
> COPY
>
> > The FLIP mentions scaling issues with 200 nodes; it's really surprising
> to me that such a small number of requests can already cause issues.
>
> One node/task doesn't mean 1 request. The following type of kerberos auth
> types has been seen by me which can run at the same time:
> HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...) Additionally
> one task is not necessarily opens 1 connection.
>
> All in all I don't have steps to reproduce but we've faced this already...
>
> G
>
>
> On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler 
> wrote:
>
>> What I don't understand is how this could overload the KDC. Aren't
>> tokens valid for a relatively long time period?
>>
>> For new deployments where many TMs are started at once I could imagine
>> it temporarily, but shouldn't the accesses to the KDC eventually
>> naturally spread out?
>>
>> The FLIP mentions scaling issues with 200 nodes; it's really surprising
>> to me that such a small number of requests can already cause issues.
>>
>> On 03/02/2022 16:14, Gabor Somogyi wrote:
>> >> I would prefer not choosing the first option
>> > Then the second option may play only.
>> >
>> >> I am not a Kerberos expert but is it really so that every application
>> that
>> > wants to use Kerberos needs to implement the token propagation itself?
>> This
>> > somehow feels as if there is something missing.
>> >
>> > OK, so first some kerberos + token intro.
>> >
>> > Some basics:
>> > * TGT can be created from keytab
>> > * TGT is needed to obtain TGS (called token)
>> > * Authentication only works with TGS -> all places where external
>> system is
>> > needed either a TGT or TGS needed
>> >
>> > There are basically 2 ways to authenticate to a kerberos secured
>> external
>> > system:
>> > 1. One needs a kerberos TGT which MUST be propagated to all JVMs. Here
>> each
>> > and every JVM obtains a TGS by itself which bombs the KDC that may
>> collapse.
>> > 2. One needs a kerberos TGT which exists only on a single place (in this
>> > case JM). JM gets a TGS which MUST be propagated to all TMs because
>> > otherwise authentication fails.
>> >
>> > Now the whole system works in a way that keytab file (we can imagine
>> that
>> > as plaintext password) is reachable on all nodes.
>> > This is a relatively huge attack surface. Now the main intention is:
>> > * Instead of propagating keytab file to all nodes propagate a TGS which
>> has
>> > limited lifetime (more secure)
>> > * Do the TGS generation in a single place so KDC may not collapse +
>> having
>> > keytab only on a single node can be better protected
>> >
>> > As a final conclusion if there is a place which expects to do kerberos
>> > authentication then it's a MUST to have either TGT or TGS.
>> > Now it's done in a pretty unsecure way. The questions are the following:
>> > * Do we want to leave this unsecure keytab propagation like this and
>> bomb
>> > KDC?
>> > * If no then how do we propagate the more secure token to TMs.
>> >
>> > If the answer to the first question is no then the FLIP can be abandoned
>> > and doesn't worth the further effort.
>> > If the answer is yes then we can talk about the how

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

That's not the single purpose of the feature but in some environments it
caused problems.
The main intention is not to deploy keytab to all the nodes because the
attack surface is bigger + reduce the KDC load.
I've already described the situation previously in this thread so copying
it here.

COPY
"KDC *may* collapse under some circumstances" is the proper wording.

We have several customers who are executing workloads on Spark/Flink. Most
of the time I'm facing their
daily issues which is heavily environment and use-case dependent. I've seen
various cases:
* where the mentioned ~1k nodes were working fine
* where KDC thought the number of requests are coming from DDOS attack so
discontinued authentication
* where KDC was simply not responding because of the load
* where KDC was intermittently had some outage (this was the most nasty
thing)

Since you're managing relatively big cluster then you know that KDC is not
only used by Spark/Flink workloads
but the whole company IT infrastructure is bombing it so it really depends
on other factors too whether KDC is reaching
it's limit or not. Not sure what kind of evidence are you looking for but
I'm not authorized to share any information about
our clients data.

One thing is for sure. The more external system types are used in workloads
(for ex. HDFS, HBase, Hive, Kafka) which
are authenticating through KDC the more possibility to reach this threshold
when the cluster is big enough.
COPY

> The FLIP mentions scaling issues with 200 nodes; it's really surprising
to me that such a small number of requests can already cause issues.

One node/task doesn't mean 1 request. The following type of kerberos auth
types has been seen by me which can run at the same time:
HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...) Additionally
one task is not necessarily opens 1 connection.

All in all I don't have steps to reproduce but we've faced this already...

G

On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler  wrote:

> What I don't understand is how this could overload the KDC. Aren't
> tokens valid for a relatively long time period?
>
> For new deployments where many TMs are started at once I could imagine
> it temporarily, but shouldn't the accesses to the KDC eventually
> naturally spread out?
>
> The FLIP mentions scaling issues with 200 nodes; it's really surprising
> to me that such a small number of requests can already cause issues.
>
> On 03/02/2022 16:14, Gabor Somogyi wrote:
> >> I would prefer not choosing the first option
> > Then the second option may play only.
> >
> >> I am not a Kerberos expert but is it really so that every application
> that
> > wants to use Kerberos needs to implement the token propagation itself?
> This
> > somehow feels as if there is something missing.
> >
> > OK, so first some kerberos + token intro.
> >
> > Some basics:
> > * TGT can be created from keytab
> > * TGT is needed to obtain TGS (called token)
> > * Authentication only works with TGS -> all places where external system
> is
> > needed either a TGT or TGS needed
> >
> > There are basically 2 ways to authenticate to a kerberos secured external
> > system:
> > 1. One needs a kerberos TGT which MUST be propagated to all JVMs. Here
> each
> > and every JVM obtains a TGS by itself which bombs the KDC that may
> collapse.
> > 2. One needs a kerberos TGT which exists only on a single place (in this
> > case JM). JM gets a TGS which MUST be propagated to all TMs because
> > otherwise authentication fails.
> >
> > Now the whole system works in a way that keytab file (we can imagine that
> > as plaintext password) is reachable on all nodes.
> > This is a relatively huge attack surface. Now the main intention is:
> > * Instead of propagating keytab file to all nodes propagate a TGS which
> has
> > limited lifetime (more secure)
> > * Do the TGS generation in a single place so KDC may not collapse +
> having
> > keytab only on a single node can be better protected
> >
> > As a final conclusion if there is a place which expects to do kerberos
> > authentication then it's a MUST to have either TGT or TGS.
> > Now it's done in a pretty unsecure way. The questions are the following:
> > * Do we want to leave this unsecure keytab propagation like this and bomb
> > KDC?
> > * If no then how do we propagate the more secure token to TMs.
> >
> > If the answer to the first question is no then the FLIP can be abandoned
> > and doesn't worth the further effort.
> > If the answer is yes then we can talk about the how part.
> >
> > G
> >
> >
> > On Thu, Feb 3, 2022 at 3:42 PM Till Rohrmann 
> wrote:
> >
> >> I would prefer not choosing the first option
> >>
> >>> Make the TM accept tasks only after registration(not sure if it's
> >> possible or makes sense at all)
> >>
> >> because it effectively means that we change how Flink's component
> lifecycle
> >> works for distributing Kerberos tokens. It also effectively means that
> a TM
> >> cannot make

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

What I don't understand is how this could overload the KDC. Aren't 
tokens valid for a relatively long time period?


For new deployments where many TMs are started at once I could imagine 
it temporarily, but shouldn't the accesses to the KDC eventually 
naturally spread out?


The FLIP mentions scaling issues with 200 nodes; it's really surprising 
to me that such a small number of requests can already cause issues.


On 03/02/2022 16:14, Gabor Somogyi wrote:

I would prefer not choosing the first option

Then the second option may play only.


I am not a Kerberos expert but is it really so that every application that

wants to use Kerberos needs to implement the token propagation itself? This
somehow feels as if there is something missing.

OK, so first some kerberos + token intro.

Some basics:
* TGT can be created from keytab
* TGT is needed to obtain TGS (called token)
* Authentication only works with TGS -> all places where external system is
needed either a TGT or TGS needed

There are basically 2 ways to authenticate to a kerberos secured external
system:
1. One needs a kerberos TGT which MUST be propagated to all JVMs. Here each
and every JVM obtains a TGS by itself which bombs the KDC that may collapse.
2. One needs a kerberos TGT which exists only on a single place (in this
case JM). JM gets a TGS which MUST be propagated to all TMs because
otherwise authentication fails.

Now the whole system works in a way that keytab file (we can imagine that
as plaintext password) is reachable on all nodes.
This is a relatively huge attack surface. Now the main intention is:
* Instead of propagating keytab file to all nodes propagate a TGS which has
limited lifetime (more secure)
* Do the TGS generation in a single place so KDC may not collapse + having
keytab only on a single node can be better protected

As a final conclusion if there is a place which expects to do kerberos
authentication then it's a MUST to have either TGT or TGS.
Now it's done in a pretty unsecure way. The questions are the following:
* Do we want to leave this unsecure keytab propagation like this and bomb
KDC?
* If no then how do we propagate the more secure token to TMs.

If the answer to the first question is no then the FLIP can be abandoned
and doesn't worth the further effort.
If the answer is yes then we can talk about the how part.

G


On Thu, Feb 3, 2022 at 3:42 PM Till Rohrmann  wrote:


I would prefer not choosing the first option


Make the TM accept tasks only after registration(not sure if it's

possible or makes sense at all)

because it effectively means that we change how Flink's component lifecycle
works for distributing Kerberos tokens. It also effectively means that a TM
cannot make progress until connected to a RM.

I am not a Kerberos expert but is it really so that every application that
wants to use Kerberos needs to implement the token propagation itself? This
somehow feels as if there is something missing.

Cheers,
Till

On Thu, Feb 3, 2022 at 3:29 PM Gabor Somogyi 
wrote:


  Isn't this something the underlying resource management system could

do

or which every process could do on its own?

I was looking for such feature but not found.
Maybe we can solve the propagation easier but then I'm waiting on better
suggestion.
If anybody has better/more simple idea then please point to a specific
feature which works on all resource management systems.


Here's an example for the TM to run workloads without being connected

to the RM, without ever having a valid token

All in all I see the main problem. Not sure what is the reason behind

that

a TM accepts tasks w/o registration but clearly not helping here.
I basically see 2 possible solutions:
* Make the TM accept tasks only after registration(not sure if it's
possible or makes sense at all)
* We send tokens right after container creation with
"updateDelegationTokens"
Not sure which one is more realistic to do since I'm not involved the new
feature.
WDYT?


On Thu, Feb 3, 2022 at 3:09 PM Till Rohrmann 

wrote:

Hi everyone,

Sorry for joining this discussion late. I also did not read all

responses

in this thread so my question might already be answered: Why does Flink
need to be involved in the propagation of the tokens? Why do we need
explicit RPC calls in the Flink domain? Isn't this something the

underlying

resource management system could do or which every process could do on

its

own? I am a bit worried that we are making Flink responsible for

something

that it is not really designed to do so.

Cheers,
Till

On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler 
wrote:


Here's an example for the TM to run workloads without being connected

to

the RM, while potentially having a valid token:

  1. TM registers at RM
  2. JobMaster requests slot from RM -> TM gets notified
  3. JM fails over
  4. TM re-offers the slot to the failed over JobMaster
  5. TM reconnects to RM at some point

Here's an example for the TM to run workloads without being connected

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

> I would prefer not choosing the first option

Then the second option may play only.

> I am not a Kerberos expert but is it really so that every application that
wants to use Kerberos needs to implement the token propagation itself? This
somehow feels as if there is something missing.

OK, so first some kerberos + token intro.

Some basics:
* TGT can be created from keytab
* TGT is needed to obtain TGS (called token)
* Authentication only works with TGS -> all places where external system is
needed either a TGT or TGS needed

There are basically 2 ways to authenticate to a kerberos secured external
system:
1. One needs a kerberos TGT which MUST be propagated to all JVMs. Here each
and every JVM obtains a TGS by itself which bombs the KDC that may collapse.
2. One needs a kerberos TGT which exists only on a single place (in this
case JM). JM gets a TGS which MUST be propagated to all TMs because
otherwise authentication fails.

Now the whole system works in a way that keytab file (we can imagine that
as plaintext password) is reachable on all nodes.
This is a relatively huge attack surface. Now the main intention is:
* Instead of propagating keytab file to all nodes propagate a TGS which has
limited lifetime (more secure)
* Do the TGS generation in a single place so KDC may not collapse + having
keytab only on a single node can be better protected

As a final conclusion if there is a place which expects to do kerberos
authentication then it's a MUST to have either TGT or TGS.
Now it's done in a pretty unsecure way. The questions are the following:
* Do we want to leave this unsecure keytab propagation like this and bomb
KDC?
* If no then how do we propagate the more secure token to TMs.

If the answer to the first question is no then the FLIP can be abandoned
and doesn't worth the further effort.
If the answer is yes then we can talk about the how part.

G


On Thu, Feb 3, 2022 at 3:42 PM Till Rohrmann  wrote:

> I would prefer not choosing the first option
>
> > Make the TM accept tasks only after registration(not sure if it's
> possible or makes sense at all)
>
> because it effectively means that we change how Flink's component lifecycle
> works for distributing Kerberos tokens. It also effectively means that a TM
> cannot make progress until connected to a RM.
>
> I am not a Kerberos expert but is it really so that every application that
> wants to use Kerberos needs to implement the token propagation itself? This
> somehow feels as if there is something missing.
>
> Cheers,
> Till
>
> On Thu, Feb 3, 2022 at 3:29 PM Gabor Somogyi 
> wrote:
>
> > >  Isn't this something the underlying resource management system could
> do
> > or which every process could do on its own?
> >
> > I was looking for such feature but not found.
> > Maybe we can solve the propagation easier but then I'm waiting on better
> > suggestion.
> > If anybody has better/more simple idea then please point to a specific
> > feature which works on all resource management systems.
> >
> > > Here's an example for the TM to run workloads without being connected
> > to the RM, without ever having a valid token
> >
> > All in all I see the main problem. Not sure what is the reason behind
> that
> > a TM accepts tasks w/o registration but clearly not helping here.
> > I basically see 2 possible solutions:
> > * Make the TM accept tasks only after registration(not sure if it's
> > possible or makes sense at all)
> > * We send tokens right after container creation with
> > "updateDelegationTokens"
> > Not sure which one is more realistic to do since I'm not involved the new
> > feature.
> > WDYT?
> >
> >
> > On Thu, Feb 3, 2022 at 3:09 PM Till Rohrmann 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Sorry for joining this discussion late. I also did not read all
> responses
> >> in this thread so my question might already be answered: Why does Flink
> >> need to be involved in the propagation of the tokens? Why do we need
> >> explicit RPC calls in the Flink domain? Isn't this something the
> underlying
> >> resource management system could do or which every process could do on
> its
> >> own? I am a bit worried that we are making Flink responsible for
> something
> >> that it is not really designed to do so.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler 
> >> wrote:
> >>
> >>> Here's an example for the TM to run workloads without being connected
> to
> >>> the RM, while potentially having a valid token:
> >>>
> >>>  1. TM registers at RM
> >>>  2. JobMaster requests slot from RM -> TM gets notified
> >>>  3. JM fails over
> >>>  4. TM re-offers the slot to the failed over JobMaster
> >>>  5. TM reconnects to RM at some point
> >>>
> >>> Here's an example for the TM to run workloads without being connected
> to
> >>> the RM, without ever having a valid token:
> >>>
> >>>  1. TM1 has a valid token and is running some tasks.
> >>>  2. TM1 crashes
> >>>  3. TM2 is started to take over, and re-uses the working

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

I would prefer not choosing the first option

> Make the TM accept tasks only after registration(not sure if it's
possible or makes sense at all)

because it effectively means that we change how Flink's component lifecycle
works for distributing Kerberos tokens. It also effectively means that a TM
cannot make progress until connected to a RM.

I am not a Kerberos expert but is it really so that every application that
wants to use Kerberos needs to implement the token propagation itself? This
somehow feels as if there is something missing.

Cheers,
Till

On Thu, Feb 3, 2022 at 3:29 PM Gabor Somogyi 
wrote:

> >  Isn't this something the underlying resource management system could do
> or which every process could do on its own?
>
> I was looking for such feature but not found.
> Maybe we can solve the propagation easier but then I'm waiting on better
> suggestion.
> If anybody has better/more simple idea then please point to a specific
> feature which works on all resource management systems.
>
> > Here's an example for the TM to run workloads without being connected
> to the RM, without ever having a valid token
>
> All in all I see the main problem. Not sure what is the reason behind that
> a TM accepts tasks w/o registration but clearly not helping here.
> I basically see 2 possible solutions:
> * Make the TM accept tasks only after registration(not sure if it's
> possible or makes sense at all)
> * We send tokens right after container creation with
> "updateDelegationTokens"
> Not sure which one is more realistic to do since I'm not involved the new
> feature.
> WDYT?
>
>
> On Thu, Feb 3, 2022 at 3:09 PM Till Rohrmann  wrote:
>
>> Hi everyone,
>>
>> Sorry for joining this discussion late. I also did not read all responses
>> in this thread so my question might already be answered: Why does Flink
>> need to be involved in the propagation of the tokens? Why do we need
>> explicit RPC calls in the Flink domain? Isn't this something the underlying
>> resource management system could do or which every process could do on its
>> own? I am a bit worried that we are making Flink responsible for something
>> that it is not really designed to do so.
>>
>> Cheers,
>> Till
>>
>> On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler 
>> wrote:
>>
>>> Here's an example for the TM to run workloads without being connected to
>>> the RM, while potentially having a valid token:
>>>
>>>  1. TM registers at RM
>>>  2. JobMaster requests slot from RM -> TM gets notified
>>>  3. JM fails over
>>>  4. TM re-offers the slot to the failed over JobMaster
>>>  5. TM reconnects to RM at some point
>>>
>>> Here's an example for the TM to run workloads without being connected to
>>> the RM, without ever having a valid token:
>>>
>>>  1. TM1 has a valid token and is running some tasks.
>>>  2. TM1 crashes
>>>  3. TM2 is started to take over, and re-uses the working directory of
>>> TM1 (new feature in 1.15!)
>>>  4. TM2 recovers the previous slot allocations
>>>  5. TM2 is informed about leading JM
>>>  6. TM2 starts registration with RM
>>>  7. TM2 offers slots to JobMaster
>>>  8. TM2 accepts task submission from JobMaster
>>>  9. ...some time later the registration completes...
>>>
>>>
>>> On 03/02/2022 14:24, Gabor Somogyi wrote:
>>> > > but it can happen that the JobMaster+TM collaborate to run stuff
>>> > without the TM being registered at the RM
>>> >
>>> > Honestly I'm not educated enough within Flink to give an example to
>>> > such scenario.
>>> > Until now I thought JM defines tasks to be done and TM just blindly
>>> > connects to external systems and does the processing.
>>> > All in all if external systems can be touched when JM + TM
>>> > collaboration happens then we need to consider that in the design.
>>> > Since I don't have an example scenario I don't know what exactly needs
>>> > to be solved.
>>> > I think we need an example case to decide whether we face a real issue
>>> > or the design is not leaking.
>>> >
>>> >
>>> > On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler 
>>> > wrote:
>>> >
>>> > > Just to learn something new. I think local recovery is clear to
>>> > me which is not touching external systems like Kafka or so
>>> > (correct me if I'm wrong). Is it possible that such case the user
>>> > code just starts to run blindly w/o JM coordination and connects
>>> > to external systems to do data processing?
>>> >
>>> > Local recovery itself shouldn't touch external systems; the TM
>>> > cannot just run user-code without the JobMaster being involved,
>>> > but it can happen that the JobMaster+TM collaborate to run stuff
>>> > without the TM being registered at the RM.
>>> >
>>> > On 03/02/2022 13:48, Gabor Somogyi wrote:
>>> >> > Any error in loading the provider (be it by accident or
>>> >> explicit checks) then is a setup error and we can fail the
>>> cluster.
>>> >>
>>> >> Fail fast is a good direction in my view. In Spark I wanted to go
>>> >> to this

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

>  Isn't this something the underlying resource management system could do
or which every process could do on its own?

I was looking for such feature but not found.
Maybe we can solve the propagation easier but then I'm waiting on better
suggestion.
If anybody has better/more simple idea then please point to a specific
feature which works on all resource management systems.

> Here's an example for the TM to run workloads without being connected to the
RM, without ever having a valid token

All in all I see the main problem. Not sure what is the reason behind that
a TM accepts tasks w/o registration but clearly not helping here.
I basically see 2 possible solutions:
* Make the TM accept tasks only after registration(not sure if it's
possible or makes sense at all)
* We send tokens right after container creation with
"updateDelegationTokens"
Not sure which one is more realistic to do since I'm not involved the new
feature.
WDYT?


On Thu, Feb 3, 2022 at 3:09 PM Till Rohrmann  wrote:

> Hi everyone,
>
> Sorry for joining this discussion late. I also did not read all responses
> in this thread so my question might already be answered: Why does Flink
> need to be involved in the propagation of the tokens? Why do we need
> explicit RPC calls in the Flink domain? Isn't this something the underlying
> resource management system could do or which every process could do on its
> own? I am a bit worried that we are making Flink responsible for something
> that it is not really designed to do so.
>
> Cheers,
> Till
>
> On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler 
> wrote:
>
>> Here's an example for the TM to run workloads without being connected to
>> the RM, while potentially having a valid token:
>>
>>  1. TM registers at RM
>>  2. JobMaster requests slot from RM -> TM gets notified
>>  3. JM fails over
>>  4. TM re-offers the slot to the failed over JobMaster
>>  5. TM reconnects to RM at some point
>>
>> Here's an example for the TM to run workloads without being connected to
>> the RM, without ever having a valid token:
>>
>>  1. TM1 has a valid token and is running some tasks.
>>  2. TM1 crashes
>>  3. TM2 is started to take over, and re-uses the working directory of
>> TM1 (new feature in 1.15!)
>>  4. TM2 recovers the previous slot allocations
>>  5. TM2 is informed about leading JM
>>  6. TM2 starts registration with RM
>>  7. TM2 offers slots to JobMaster
>>  8. TM2 accepts task submission from JobMaster
>>  9. ...some time later the registration completes...
>>
>>
>> On 03/02/2022 14:24, Gabor Somogyi wrote:
>> > > but it can happen that the JobMaster+TM collaborate to run stuff
>> > without the TM being registered at the RM
>> >
>> > Honestly I'm not educated enough within Flink to give an example to
>> > such scenario.
>> > Until now I thought JM defines tasks to be done and TM just blindly
>> > connects to external systems and does the processing.
>> > All in all if external systems can be touched when JM + TM
>> > collaboration happens then we need to consider that in the design.
>> > Since I don't have an example scenario I don't know what exactly needs
>> > to be solved.
>> > I think we need an example case to decide whether we face a real issue
>> > or the design is not leaking.
>> >
>> >
>> > On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler 
>> > wrote:
>> >
>> > > Just to learn something new. I think local recovery is clear to
>> > me which is not touching external systems like Kafka or so
>> > (correct me if I'm wrong). Is it possible that such case the user
>> > code just starts to run blindly w/o JM coordination and connects
>> > to external systems to do data processing?
>> >
>> > Local recovery itself shouldn't touch external systems; the TM
>> > cannot just run user-code without the JobMaster being involved,
>> > but it can happen that the JobMaster+TM collaborate to run stuff
>> > without the TM being registered at the RM.
>> >
>> > On 03/02/2022 13:48, Gabor Somogyi wrote:
>> >> > Any error in loading the provider (be it by accident or
>> >> explicit checks) then is a setup error and we can fail the cluster.
>> >>
>> >> Fail fast is a good direction in my view. In Spark I wanted to go
>> >> to this direction but there were other opinions so there if a
>> >> provider is not loaded then the workload goes further.
>> >> Of course the processing will fail if the token is missing...
>> >>
>> >> > Requiring HBase (and Hadoop for that matter) to be on the JM
>> >> system classpath would be a bit unfortunate. Have you considered
>> >> loading the providers as plugins?
>> >>
>> >> Even if it's unfortunate the actual implementation is depending
>> >> on that already. Moving HBase and/or all token providers into
>> >> plugins is a possibility.
>> >> That way if one wants to use a specific provider then a plugin
>> >> need to be added. If we would like to go to this direction I
>> >> would do

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Hi everyone,

Sorry for joining this discussion late. I also did not read all responses
in this thread so my question might already be answered: Why does Flink
need to be involved in the propagation of the tokens? Why do we need
explicit RPC calls in the Flink domain? Isn't this something the underlying
resource management system could do or which every process could do on its
own? I am a bit worried that we are making Flink responsible for something
that it is not really designed to do so.

Cheers,
Till

On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler  wrote:

> Here's an example for the TM to run workloads without being connected to
> the RM, while potentially having a valid token:
>
>  1. TM registers at RM
>  2. JobMaster requests slot from RM -> TM gets notified
>  3. JM fails over
>  4. TM re-offers the slot to the failed over JobMaster
>  5. TM reconnects to RM at some point
>
> Here's an example for the TM to run workloads without being connected to
> the RM, without ever having a valid token:
>
>  1. TM1 has a valid token and is running some tasks.
>  2. TM1 crashes
>  3. TM2 is started to take over, and re-uses the working directory of
> TM1 (new feature in 1.15!)
>  4. TM2 recovers the previous slot allocations
>  5. TM2 is informed about leading JM
>  6. TM2 starts registration with RM
>  7. TM2 offers slots to JobMaster
>  8. TM2 accepts task submission from JobMaster
>  9. ...some time later the registration completes...
>
>
> On 03/02/2022 14:24, Gabor Somogyi wrote:
> > > but it can happen that the JobMaster+TM collaborate to run stuff
> > without the TM being registered at the RM
> >
> > Honestly I'm not educated enough within Flink to give an example to
> > such scenario.
> > Until now I thought JM defines tasks to be done and TM just blindly
> > connects to external systems and does the processing.
> > All in all if external systems can be touched when JM + TM
> > collaboration happens then we need to consider that in the design.
> > Since I don't have an example scenario I don't know what exactly needs
> > to be solved.
> > I think we need an example case to decide whether we face a real issue
> > or the design is not leaking.
> >
> >
> > On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler 
> > wrote:
> >
> > > Just to learn something new. I think local recovery is clear to
> > me which is not touching external systems like Kafka or so
> > (correct me if I'm wrong). Is it possible that such case the user
> > code just starts to run blindly w/o JM coordination and connects
> > to external systems to do data processing?
> >
> > Local recovery itself shouldn't touch external systems; the TM
> > cannot just run user-code without the JobMaster being involved,
> > but it can happen that the JobMaster+TM collaborate to run stuff
> > without the TM being registered at the RM.
> >
> > On 03/02/2022 13:48, Gabor Somogyi wrote:
> >> > Any error in loading the provider (be it by accident or
> >> explicit checks) then is a setup error and we can fail the cluster.
> >>
> >> Fail fast is a good direction in my view. In Spark I wanted to go
> >> to this direction but there were other opinions so there if a
> >> provider is not loaded then the workload goes further.
> >> Of course the processing will fail if the token is missing...
> >>
> >> > Requiring HBase (and Hadoop for that matter) to be on the JM
> >> system classpath would be a bit unfortunate. Have you considered
> >> loading the providers as plugins?
> >>
> >> Even if it's unfortunate the actual implementation is depending
> >> on that already. Moving HBase and/or all token providers into
> >> plugins is a possibility.
> >> That way if one wants to use a specific provider then a plugin
> >> need to be added. If we would like to go to this direction I
> >> would do that in a separate
> >> FLIP not to have feature creep here. The actual FLIP already
> >> covers several thousand lines of code changes.
> >>
> >> > This is missing from the FLIP. From my experience with the
> >> metric reporters, having the implementation rely on the
> >> configuration is really annoying for testing purposes. That's why
> >> I suggested factories; they can take care of extracting all
> >> parameters that the implementation needs, and then pass them
> >> nicely via the constructor.
> >>
> >> ServiceLoader provided services must have a norarg constructor
> >> where no parameters can be passed.
> >> As a side note testing delegation token providers is pain in the
> >> ass and not possible with automated tests without creating a
> >> fully featured kerberos cluster with KDC, HDFS, HBase, Kafka, etc..
> >> We've had several tries in Spark but then gave it up because of
> >> the complexity and the flakyness of it so I wouldn't care much
> >> about unit testing.
> >> The sad truth is that most of the token

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Here's an example for the TM to run workloads without being connected to 
the RM, while potentially having a valid token:

1. TM registers at RM
2. JobMaster requests slot from RM -> TM gets notified
3. JM fails over
4. TM re-offers the slot to the failed over JobMaster
5. TM reconnects to RM at some point

Here's an example for the TM to run workloads without being connected to 
the RM, without ever having a valid token:

1. TM1 has a valid token and is running some tasks.
2. TM1 crashes
3. TM2 is started to take over, and re-uses the working directory of
   TM1 (new feature in 1.15!)
4. TM2 recovers the previous slot allocations
5. TM2 is informed about leading JM
6. TM2 starts registration with RM
7. TM2 offers slots to JobMaster
8. TM2 accepts task submission from JobMaster
9. ...some time later the registration completes...

On 03/02/2022 14:24, Gabor Somogyi wrote:
> but it can happen that the JobMaster+TM collaborate to run stuff 
without the TM being registered at the RM

Honestly I'm not educated enough within Flink to give an example to 
such scenario.
Until now I thought JM defines tasks to be done and TM just blindly 
connects to external systems and does the processing.
All in all if external systems can be touched when JM + TM 
collaboration happens then we need to consider that in the design.
Since I don't have an example scenario I don't know what exactly needs 
to be solved.
I think we need an example case to decide whether we face a real issue 
or the design is not leaking.

On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler  
wrote:

> Just to learn something new. I think local recovery is clear to
me which is not touching external systems like Kafka or so
(correct me if I'm wrong). Is it possible that such case the user
code just starts to run blindly w/o JM coordination and connects
to external systems to do data processing?

Local recovery itself shouldn't touch external systems; the TM
cannot just run user-code without the JobMaster being involved,
but it can happen that the JobMaster+TM collaborate to run stuff
without the TM being registered at the RM.

On 03/02/2022 13:48, Gabor Somogyi wrote:

> Any error in loading the provider (be it by accident or
explicit checks) then is a setup error and we can fail the cluster.

Fail fast is a good direction in my view. In Spark I wanted to go
to this direction but there were other opinions so there if a
provider is not loaded then the workload goes further.
Of course the processing will fail if the token is missing...

> Requiring HBase (and Hadoop for that matter) to be on the JM
system classpath would be a bit unfortunate. Have you considered
loading the providers as plugins?

Even if it's unfortunate the actual implementation is depending
on that already. Moving HBase and/or all token providers into
plugins is a possibility.
That way if one wants to use a specific provider then a plugin
need to be added. If we would like to go to this direction I
would do that in a separate
FLIP not to have feature creep here. The actual FLIP already
covers several thousand lines of code changes.

> This is missing from the FLIP. From my experience with the
metric reporters, having the implementation rely on the
configuration is really annoying for testing purposes. That's why
I suggested factories; they can take care of extracting all
parameters that the implementation needs, and then pass them
nicely via the constructor.

ServiceLoader provided services must have a norarg constructor
where no parameters can be passed.
As a side note testing delegation token providers is pain in the
ass and not possible with automated tests without creating a
fully featured kerberos cluster with KDC, HDFS, HBase, Kafka, etc..
We've had several tries in Spark but then gave it up because of
the complexity and the flakyness of it so I wouldn't care much
about unit testing.
The sad truth is that most of the token providers can be tested
manually on cluster.

Of course this doesn't mean that the whole code is not intended
to be covered with tests. I mean couple of parts can be
automatically tested but providers are not such.

> This also implies that any fields of the provider wouldn't
inherently have to be mutable.

I think this is not an issue. A provider connects to a service,
obtains token(s) and then close the connection and never seen the
need of an intermediate state.
I've just mentioned the singleton behavior to be clear.

> One examples is a TM restart + local recovery, where the TM
eagerly offers the previous set of slots to the leading JM.

Just to learn something new. I think local recovery is clear to
me which is not touching external systems like Kafka or so
(correct me if I'm wrong).
Is it possible that such case the user code

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

>  but it can happen that the JobMaster+TM collaborate to run stuff without
the TM being registered at the RM

Honestly I'm not educated enough within Flink to give an example to such
scenario.
Until now I thought JM defines tasks to be done and TM just blindly
connects to external systems and does the processing.
All in all if external systems can be touched when JM + TM collaboration
happens then we need to consider that in the design.
Since I don't have an example scenario I don't know what exactly needs to
be solved.
I think we need an example case to decide whether we face a real issue or
the design is not leaking.


On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler  wrote:

> > Just to learn something new. I think local recovery is clear to me which
> is not touching external systems like Kafka or so (correct me if I'm
> wrong). Is it possible that such case the user code just starts to run
> blindly w/o JM coordination and connects to external systems to do data
> processing?
>
> Local recovery itself shouldn't touch external systems; the TM cannot just
> run user-code without the JobMaster being involved, but it can happen that
> the JobMaster+TM collaborate to run stuff without the TM being registered
> at the RM.
>
> On 03/02/2022 13:48, Gabor Somogyi wrote:
>
> > Any error in loading the provider (be it by accident or explicit checks)
> then is a setup error and we can fail the cluster.
>
> Fail fast is a good direction in my view. In Spark I wanted to go to this
> direction but there were other opinions so there if a provider is not
> loaded then the workload goes further.
> Of course the processing will fail if the token is missing...
>
> > Requiring HBase (and Hadoop for that matter) to be on the JM system
> classpath would be a bit unfortunate. Have you considered loading the
> providers as plugins?
>
> Even if it's unfortunate the actual implementation is depending on that
> already. Moving HBase and/or all token providers into plugins is a
> possibility.
> That way if one wants to use a specific provider then a plugin need to be
> added. If we would like to go to this direction I would do that in a
> separate
> FLIP not to have feature creep here. The actual FLIP already covers
> several thousand lines of code changes.
>
> > This is missing from the FLIP. From my experience with the metric
> reporters, having the implementation rely on the configuration is really
> annoying for testing purposes. That's why I suggested factories; they can
> take care of extracting all parameters that the implementation needs, and
> then pass them nicely via the constructor.
>
> ServiceLoader provided services must have a norarg constructor where no
> parameters can be passed.
> As a side note testing delegation token providers is pain in the ass and
> not possible with automated tests without creating a fully featured
> kerberos cluster with KDC, HDFS, HBase, Kafka, etc..
> We've had several tries in Spark but then gave it up because of the
> complexity and the flakyness of it so I wouldn't care much about unit
> testing.
> The sad truth is that most of the token providers can be tested manually
> on cluster.
>
> Of course this doesn't mean that the whole code is not intended to be
> covered with tests. I mean couple of parts can be automatically tested but
> providers are not such.
>
> > This also implies that any fields of the provider wouldn't inherently
> have to be mutable.
>
> I think this is not an issue. A provider connects to a service, obtains
> token(s) and then close the connection and never seen the need of an
> intermediate state.
> I've just mentioned the singleton behavior to be clear.
>
> > One examples is a TM restart + local recovery, where the TM eagerly
> offers the previous set of slots to the leading JM.
>
> Just to learn something new. I think local recovery is clear to me which
> is not touching external systems like Kafka or so (correct me if I'm wrong).
> Is it possible that such case the user code just starts to run blindly w/o
> JM coordination and connects to external systems to do data processing?
>
>
> On Thu, Feb 3, 2022 at 1:09 PM Chesnay Schepler 
> wrote:
>
>> 1)
>> The manager certainly shouldn't check for specific implementations.
>> The problem with classpath-based checks is it can easily happen that the
>> provider can't be loaded in the first place (e.g., if you don't use
>> reflection, which you currently kinda force), and in that case Flink can't
>> tell whether the token is not required or the cluster isn't set up
>> correctly.
>> As I see it we shouldn't try to be clever; if the users wants kerberos,
>> then have him enable the providers. Any error in loading the provider (be
>> it by accident or explicit checks) then is a setup error and we can fail
>> the cluster.
>> If we still want to auto-detect whether the provider should be used, note
>> that using factories would make this easier; the factory can check the
>> classpath (not having any direct dependencies

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

> Just to learn something new. I think local recovery is clear to me 
which is not touching external systems like Kafka or so (correct me if 
I'm wrong). Is it possible that such case the user code just starts to 
run blindly w/o JM coordination and connects to external systems to do 
data processing?

Local recovery itself shouldn't touch external systems; the TM cannot 
just run user-code without the JobMaster being involved, but it can 
happen that the JobMaster+TM collaborate to run stuff without the TM 
being registered at the RM.

On 03/02/2022 13:48, Gabor Somogyi wrote:
> Any error in loading the provider (be it by accident or explicit 
checks) then is a setup error and we can fail the cluster.

Fail fast is a good direction in my view. In Spark I wanted to go to 
this direction but there were other opinions so there if a provider is 
not loaded then the workload goes further.

Of course the processing will fail if the token is missing...

> Requiring HBase (and Hadoop for that matter) to be on the JM system 
classpath would be a bit unfortunate. Have you considered loading the 
providers as plugins?

Even if it's unfortunate the actual implementation is depending on 
that already. Moving HBase and/or all token providers into plugins is 
a possibility.
That way if one wants to use a specific provider then a plugin need to 
be added. If we would like to go to this direction I would do that in 
a separate
FLIP not to have feature creep here. The actual FLIP already covers 
several thousand lines of code changes.

> This is missing from the FLIP. From my experience with the metric 
reporters, having the implementation rely on the configuration is 
really annoying for testing purposes. That's why I suggested 
factories; they can take care of extracting all parameters that the 
implementation needs, and then pass them nicely via the constructor.

ServiceLoader provided services must have a norarg constructor where 
no parameters can be passed.
As a side note testing delegation token providers is pain in the ass 
and not possible with automated tests without creating a fully 
featured kerberos cluster with KDC, HDFS, HBase, Kafka, etc..
We've had several tries in Spark but then gave it up because of the 
complexity and the flakyness of it so I wouldn't care much about unit 
testing.
The sad truth is that most of the token providers can be tested 
manually on cluster.

Of course this doesn't mean that the whole code is not intended to be 
covered with tests. I mean couple of parts can be automatically tested 
but providers are not such.

> This also implies that any fields of the provider wouldn't 
inherently have to be mutable.

I think this is not an issue. A provider connects to a service, 
obtains token(s) and then close the connection and never seen the need 
of an intermediate state.

I've just mentioned the singleton behavior to be clear.

> One examples is a TM restart + local recovery, where the TM eagerly 
offers the previous set of slots to the leading JM.

Just to learn something new. I think local recovery is clear to me 
which is not touching external systems like Kafka or so (correct me if 
I'm wrong).
Is it possible that such case the user code just starts to run blindly 
w/o JM coordination and connects to external systems to do data 
processing?

On Thu, Feb 3, 2022 at 1:09 PM Chesnay Schepler  
wrote:

1)
The manager certainly shouldn't check for specific implementations.
The problem with classpath-based checks is it can easily happen
that the provider can't be loaded in the first place (e.g., if you
don't use reflection, which you currently kinda force), and in
that case Flink can't tell whether the token is not required or
the cluster isn't set up correctly.
As I see it we shouldn't try to be clever; if the users wants
kerberos, then have him enable the providers. Any error in loading
the provider (be it by accident or explicit checks) then is a
setup error and we can fail the cluster.
If we still want to auto-detect whether the provider should be
used, note that using factories would make this easier; the
factory can check the classpath (not having any direct
dependencies on HBase avoids the case above), and the provider no
longer needs reflection because it will only be used iff HBase is
on the CP.

Requiring HBase (and Hadoop for that matter) to be on the JM
system classpath would be a bit unfortunate. Have you considered
loading the providers as plugins?

2) > DelegationTokenProvider#init method

This is missing from the FLIP. From my experience with the metric
reporters, having the implementation rely on the configuration is
really annoying for testing purposes. That's why I suggested
factories; they can take care of extracting all parameters that
the implementation needs, and then pass them nicely via the
constructor. This also implies that any fields of the

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

> Any error in loading the provider (be it by accident or explicit checks)
then is a setup error and we can fail the cluster.

Fail fast is a good direction in my view. In Spark I wanted to go to this
direction but there were other opinions so there if a provider is not
loaded then the workload goes further.
Of course the processing will fail if the token is missing...

> Requiring HBase (and Hadoop for that matter) to be on the JM system
classpath would be a bit unfortunate. Have you considered loading the
providers as plugins?

Even if it's unfortunate the actual implementation is depending on that
already. Moving HBase and/or all token providers into plugins is a
possibility.
That way if one wants to use a specific provider then a plugin need to be
added. If we would like to go to this direction I would do that in a
separate
FLIP not to have feature creep here. The actual FLIP already covers several
thousand lines of code changes.

> This is missing from the FLIP. From my experience with the metric
reporters, having the implementation rely on the configuration is really
annoying for testing purposes. That's why I suggested factories; they can
take care of extracting all parameters that the implementation needs, and
then pass them nicely via the constructor.

ServiceLoader provided services must have a norarg constructor where no
parameters can be passed.
As a side note testing delegation token providers is pain in the ass and
not possible with automated tests without creating a fully featured
kerberos cluster with KDC, HDFS, HBase, Kafka, etc..
We've had several tries in Spark but then gave it up because of the
complexity and the flakyness of it so I wouldn't care much about unit
testing.
The sad truth is that most of the token providers can be tested manually on
cluster.

Of course this doesn't mean that the whole code is not intended to be
covered with tests. I mean couple of parts can be automatically tested but
providers are not such.

> This also implies that any fields of the provider wouldn't inherently
have to be mutable.

I think this is not an issue. A provider connects to a service, obtains
token(s) and then close the connection and never seen the need of an
intermediate state.
I've just mentioned the singleton behavior to be clear.

> One examples is a TM restart + local recovery, where the TM eagerly
offers the previous set of slots to the leading JM.

Just to learn something new. I think local recovery is clear to me which is
not touching external systems like Kafka or so (correct me if I'm wrong).
Is it possible that such case the user code just starts to run blindly w/o
JM coordination and connects to external systems to do data processing?

On Thu, Feb 3, 2022 at 1:09 PM Chesnay Schepler  wrote:

> 1)
> The manager certainly shouldn't check for specific implementations.
> The problem with classpath-based checks is it can easily happen that the
> provider can't be loaded in the first place (e.g., if you don't use
> reflection, which you currently kinda force), and in that case Flink can't
> tell whether the token is not required or the cluster isn't set up
> correctly.
> As I see it we shouldn't try to be clever; if the users wants kerberos,
> then have him enable the providers. Any error in loading the provider (be
> it by accident or explicit checks) then is a setup error and we can fail
> the cluster.
> If we still want to auto-detect whether the provider should be used, note
> that using factories would make this easier; the factory can check the
> classpath (not having any direct dependencies on HBase avoids the case
> above), and the provider no longer needs reflection because it will only be
> used iff HBase is on the CP.
>
> Requiring HBase (and Hadoop for that matter) to be on the JM system
> classpath would be a bit unfortunate. Have you considered loading the
> providers as plugins?
>
> 2) > DelegationTokenProvider#init method
>
> This is missing from the FLIP. From my experience with the metric
> reporters, having the implementation rely on the configuration is really
> annoying for testing purposes. That's why I suggested factories; they can
> take care of extracting all parameters that the implementation needs, and
> then pass them nicely via the constructor. This also implies that any
> fields of the provider wouldn't inherently have to be mutable.
>
> > workloads are not yet running until the initial token set is not
> propagated.
>
> This isn't necessarily true. It can happen that tasks are being deployed
> to the TM without it having registered with the RM; there is currently no
> requirement that a TM must be registered before it may offer slots / accept
> task submissions.
> One examples is a TM restart + local recovery, where the TM eagerly offers
> the previous set of slots to the leading JM.
>
> On 03/02/2022 12:39, Gabor Somogyi wrote:
>
> Thanks for the quick response!
> Appreciate your invested time...
>
> G
>
> On Thu, Feb 3, 2022 at 11:12 AM Chesnay Schepler

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

1)
The manager certainly shouldn't check for specific implementations.
The problem with classpath-based checks is it can easily happen that the
provider can't be loaded in the first place (e.g., if you don't use
reflection, which you currently kinda force), and in that case Flink
can't tell whether the token is not required or the cluster isn't set up
correctly.
As I see it we shouldn't try to be clever; if the users wants kerberos,
then have him enable the providers. Any error in loading the provider
(be it by accident or explicit checks) then is a setup error and we can
fail the cluster.
If we still want to auto-detect whether the provider should be used,
note that using factories would make this easier; the factory can check
the classpath (not having any direct dependencies on HBase avoids the
case above), and the provider no longer needs reflection because it will
only be used iff HBase is on the CP.

Requiring HBase (and Hadoop for that matter) to be on the JM system
classpath would be a bit unfortunate. Have you considered loading the
providers as plugins?

2) > DelegationTokenProvider#init method

This is missing from the FLIP. From my experience with the metric
reporters, having the implementation rely on the configuration is really
annoying for testing purposes. That's why I suggested factories; they
can take care of extracting all parameters that the implementation
needs, and then pass them nicely via the constructor. This also implies
that any fields of the provider wouldn't inherently have to be mutable.

> workloads are not yet running until the initial token set is not
propagated.

This isn't necessarily true. It can happen that tasks are being deployed
to the TM without it having registered with the RM; there is currently
no requirement that a TM must be registered before it may offer slots /
accept task submissions.
One examples is a TM restart + local recovery, where the TM eagerly
offers the previous set of slots to the leading JM.

On 03/02/2022 12:39, Gabor Somogyi wrote:

Thanks for the quick response!
Appreciate your invested time...

On Thu, Feb 3, 2022 at 11:12 AM Chesnay Schepler
wrote:

Thanks for answering the questions!

1) Does the HBase provider require HBase to be on the classpath?

To be instantiated no, to obtain a token yes.

If so, then could it even be loaded if Hbase is on the classpath?

The provider can be loaded but inside the provider it would detect
whether HBase is on classpath.
Just to be crystal clear here this is the actual implementation what I
would like to take over into the Provider.
Please see:
https://github.com/apache/flink/blob/e6210d40491ff28c779b8604e425f01983f8a3d7/flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java#L243-L254

I've considered to load only the necessary Providers but that would
mean a generic Manager need to know that if the newly loaded Provider
is instanceof HBaseDelegationTokenProvider, then it need to be skipped.
I think it would add unnecessary complexity to the Manager and it
would contain ugly code parts(at least in my view ugly), like this
if (provider instanceof HBaseDelegationTokenProvider &&
hbaseIsNotOnClasspath()) {

// Skip intentionally
} else if (provider instanceof SomethingElseDelegationTokenProvider &&
somethingElseIsNotOnClasspath()) {

// Skip intentionally
} else {
providers.put(provider.serviceName(), provider);
}
I think the least code and most clear approach is to load the
providers and decide inside whether everything is given to obtain a token.

If not, then you're assuming the classpath of the JM/TM to be
the same, which isn't necessarily true (in general; and also if
Hbase is loaded from the user-jar).

I'm not assuming that the classpath of JM/TM must be the same. If the
HBase jar is coming from the user-jar then the HBase code is going to
use UGI within the JVM when authentication required.
Of course I've not yet tested within Flink but in Spark it is working
fine.
All in all JM/TM classpath may be different but on both side HBase jar
must exists somehow.

2) None of the /Providers/ in your PoC get access to the
configuration. Only the /Manager/ is. Note that I do not know
whether there is a need for the providers to have access to the
config, as that's very implementation specific I suppose.

You're right. Since this is just a POC and I don't have green light
I've not put too many effort for a proper
self-review. DelegationTokenProvider#init method must get Flink
configuration.
The reason behind is that several further configuration can be find
out using that. A good example is to get Hadoop conf.
The rationale behind is the same just like before, it would be good to
create a generic Manager as possible.
To be more specific some code must load Hadoop conf which could be the
Manager or the Provider.
If the manager does that then the generic Manager must be modified all
the time when

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Thanks for the quick response!
Appreciate your invested time...

G

On Thu, Feb 3, 2022 at 11:12 AM Chesnay Schepler  wrote:

> Thanks for answering the questions!
>
> 1) Does the HBase provider require HBase to be on the classpath?
>

To be instantiated no, to obtain a token yes.

> If so, then could it even be loaded if Hbase is on the classpath?
>

The provider can be loaded but inside the provider it would detect whether
HBase is on classpath.
Just to be crystal clear here this is the actual implementation what I
would like to take over into the Provider.
Please see:
https://github.com/apache/flink/blob/e6210d40491ff28c779b8604e425f01983f8a3d7/flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java#L243-L254

I've considered to load only the necessary Providers but that would mean a
generic Manager need to know that if the newly loaded Provider is
instanceof HBaseDelegationTokenProvider, then it need to be skipped.
I think it would add unnecessary complexity to the Manager and it would
contain ugly code parts(at least in my view ugly), like this
if (provider instanceof HBaseDelegationTokenProvider &&
hbaseIsNotOnClasspath()) {
  // Skip intentionally
} else if (provider instanceof SomethingElseDelegationTokenProvider &&
somethingElseIsNotOnClasspath()) {
  // Skip intentionally
} else {
  providers.put(provider.serviceName(), provider);
}
I think the least code and most clear approach is to load the providers and
decide inside whether everything is given to obtain a token.

If not, then you're assuming the classpath of the JM/TM to be the same,
> which isn't necessarily true (in general; and also if Hbase is loaded from
> the user-jar).
>

I'm not assuming that the classpath of JM/TM must be the same. If the HBase
jar is coming from the user-jar then the HBase code is going to use UGI
within the JVM when authentication required.
Of course I've not yet tested within Flink but in Spark it is working fine.
All in all JM/TM classpath may be different but on both side HBase jar must
exists somehow.

> 2) None of the *Providers* in your PoC get access to the configuration.
> Only the *Manager* is. Note that I do not know whether there is a need
> for the providers to have access to the config, as that's very
> implementation specific I suppose.
>

You're right. Since this is just a POC and I don't have green light I've
not put too many effort for a proper
self-review. DelegationTokenProvider#init method must get Flink
configuration.
The reason behind is that several further configuration can be find out
using that. A good example is to get Hadoop conf.
The rationale behind is the same just like before, it would be good to
create a generic Manager as possible.
To be more specific some code must load Hadoop conf which could be the
Manager or the Provider.
If the manager does that then the generic Manager must be modified all the
time when something special thing is needed for a new provider.
This could be super problematic when a custom provider is written.

> 10) I'm not sure myself. It could be something as trivial as creating some
> temporary directory in HDFS I suppose.
>

I've not found of such task.YARN and K8S are not expecting such things from
executors and workloads are not yet running until the initial token set is
not propagated.

>
> On 03/02/2022 10:23, Gabor Somogyi wrote:
>
> Please see my answers inline. Hope provided satisfying answers to all
> questions.
>
> G
>
> On Thu, Feb 3, 2022 at 9:17 AM Chesnay Schepler  
>  wrote:
>
>
> I have a few question that I'd appreciate if you could answer them.
>
>1. How does the Provider know whether it is required or not?
>
> All registered providers which are registered properly are going to be
>
> loaded and asked to obtain tokens. Worth to mention every provider
> has the right to decide whether it wants to obtain tokens or not (bool
> delegationTokensRequired()). For instance if provider detects that
> HBase is not on classpath or not configured properly then no tokens are
> obtained from that specific provider.
>
> You may ask how a provider is registered. Here it is:
> The provider is on classpath + there is a META-INF file which contains the
> name of the provider, for example:
> META-INF/services/org.apache.flink.runtime.security.token.DelegationTokenProvider
>  
> 
>
> 1. How does the configuration of Providers work (how do they get
>access to a configuration)?
>
> Flink configuration is going to be passed to all providers. Please see the
>
> POC 
> here:https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1
> Service specific configurations are loaded on-the-fly. For example in HBase
> case it looks for HBase

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Thanks for answering the questions!

1) Does the HBase provider require HBase to be on the classpath?
If so, then could it even be loaded if Hbase is on the classpath?
If not, then you're assuming the classpath of the JM/TM to be the
same, which isn't necessarily true (in general; and also if Hbase is
loaded from the user-jar).
2) None of the /Providers/ in your PoC get access to the configuration.
Only the /Manager/ is. Note that I do not know whether there is a need
for the providers to have access to the config, as that's very
implementation specific I suppose.
10) I'm not sure myself. It could be something as trivial as creating
some temporary directory in HDFS I suppose.

On 03/02/2022 10:23, Gabor Somogyi wrote:

Please see my answers inline. Hope provided satisfying answers to all
questions.

On Thu, Feb 3, 2022 at 9:17 AM Chesnay Schepler wrote:

I have a few question that I'd appreciate if you could answer them.

1. How does the Provider know whether it is required or not?

All registered providers which are registered properly are going to be

loaded and asked to obtain tokens. Worth to mention every provider
has the right to decide whether it wants to obtain tokens or not (bool
delegationTokensRequired()). For instance if provider detects that
HBase is not on classpath or not configured properly then no tokens are
obtained from that specific provider.

You may ask how a provider is registered. Here it is:
The provider is on classpath + there is a META-INF file which contains the
name of the provider, for example:
META-INF/services/org.apache.flink.runtime.security.token.DelegationTokenProvider

1. How does the configuration of Providers work (how do they get
access to a configuration)?

Flink configuration is going to be passed to all providers. Please see the

POC here:
https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1
Service specific configurations are loaded on-the-fly. For example in HBase
case it looks for HBase configuration class which will be instantiated
within the provider.

1. How does a user select providers? (Is it purely based on the
provider being on the classpath?)

Providers can be explicitly turned off with the following config:

"security.kerberos.tokens.${name}.enabled". I've never seen that 2
different implementation would exist for a specific
external service, but if this edge case would exist then the mentioned
config need to be added, a new provider with a different name need to be
implemented and registered.
All in all we've seen that provider handling is not user specific task but
a cluster admin one. If a specific provider is needed then it's implemented
once per company, registered once
to the clusters and then all users may or may not use the obtained tokens.

Worth to mention the system will know which token need to be used when HDFS
is accessed, this part is automatic.

1. How can a user override an existing provider?

Pease see the previous bulletpoint.
1. What is DelegationTokenProvider#name() used for?

By default all providers which are registered properly (on classpath +

META-INF entry) are on by default. With
"security.kerberos.tokens.${name}.enabled" a specific provider can be
turned off.
Additionally I'm intended to use this in log entries later on for debugging
purposes. For example "hadoopfs provider obtained 2 tokens with ID...".
This would help what and when is happening
with tokens. The same applies to TaskManager side: "2 hadoopfs provider
tokens arrived with ID...". Important to note that the secret part will be
hidden in the mentioned log entries to keep the
attach surface low.

1. What happens if the names of 2 providers are identical?

Presume you mean 2 different classes which both registered and having the

same logic inside. This case both will be loaded and both is going to
obtain token(s) for the same service.
Both obtained token(s) are going to be added to the UGI. As a result the
second will overwrite the first but the order is not defined. Since both
token(s) are valid no matter which one is
used then access to the external system will work.

When the class names are same then service loader only loads a single entry
because services are singletons. That's the reason why state inside
providers are not advised.

1. Will we directly load the provider, or first load a factory
(usually preferable)?

Intended to load a provider directly by DTM. We can add an extra layer to

have factory but after consideration I came to a conclusion that it would
be and overkill this case.
Please have a look how it's planned to load providers now:
https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-d56a0bc77335ff23c0318f6dec1872e7b19b1a9ef6d10fff8fbaab9aecac94faR54-R81

1. What is the

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Please see my answers inline. Hope provided satisfying answers to all
questions.

G

On Thu, Feb 3, 2022 at 9:17 AM Chesnay Schepler  wrote:

> I have a few question that I'd appreciate if you could answer them.
>
>1. How does the Provider know whether it is required or not?
>
> All registered providers which are registered properly are going to be
loaded and asked to obtain tokens. Worth to mention every provider
has the right to decide whether it wants to obtain tokens or not (bool
delegationTokensRequired()). For instance if provider detects that
HBase is not on classpath or not configured properly then no tokens are
obtained from that specific provider.

You may ask how a provider is registered. Here it is:
The provider is on classpath + there is a META-INF file which contains the
name of the provider, for example:
META-INF/services/org.apache.flink.runtime.security.token.DelegationTokenProvider



>
>1. How does the configuration of Providers work (how do they get
>access to a configuration)?
>
> Flink configuration is going to be passed to all providers. Please see the
POC here:
https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1
Service specific configurations are loaded on-the-fly. For example in HBase
case it looks for HBase configuration class which will be instantiated
within the provider.

>
>1. How does a user select providers? (Is it purely based on the
>provider being on the classpath?)
>
> Providers can be explicitly turned off with the following config:
"security.kerberos.tokens.${name}.enabled". I've never seen that 2
different implementation would exist for a specific
external service, but if this edge case would exist then the mentioned
config need to be added, a new provider with a different name need to be
implemented and registered.
All in all we've seen that provider handling is not user specific task but
a cluster admin one. If a specific provider is needed then it's implemented
once per company, registered once
to the clusters and then all users may or may not use the obtained tokens.

Worth to mention the system will know which token need to be used when HDFS
is accessed, this part is automatic.

>
>1. How can a user override an existing provider?
>
> Pease see the previous bulletpoint.

>
>1. What is DelegationTokenProvider#name() used for?
>
> By default all providers which are registered properly (on classpath +
META-INF entry) are on by default. With
"security.kerberos.tokens.${name}.enabled" a specific provider can be
turned off.
Additionally I'm intended to use this in log entries later on for debugging
purposes. For example "hadoopfs provider obtained 2 tokens with ID...".
This would help what and when is happening
with tokens. The same applies to TaskManager side: "2 hadoopfs provider
tokens arrived with ID...". Important to note that the secret part will be
hidden in the mentioned log entries to keep the
attach surface low.

>
>1. What happens if the names of 2 providers are identical?
>
> Presume you mean 2 different classes which both registered and having the
same logic inside. This case both will be loaded and both is going to
obtain token(s) for the same service.
Both obtained token(s) are going to be added to the UGI. As a result the
second will overwrite the first but the order is not defined. Since both
token(s) are valid no matter which one is
used then access to the external system will work.

When the class names are same then service loader only loads a single entry
because services are singletons. That's the reason why state inside
providers are not advised.

>
>1. Will we directly load the provider, or first load a factory
>(usually preferable)?
>
> Intended to load a provider directly by DTM. We can add an extra layer to
have factory but after consideration I came to a conclusion that it would
be and overkill this case.
Please have a look how it's planned to load providers now:
https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-d56a0bc77335ff23c0318f6dec1872e7b19b1a9ef6d10fff8fbaab9aecac94faR54-R81


>
>1. What is the Credentials class (it would necessarily have to be a
>public api as well)?
>
> Credentials class is coming from Hadoop. My main intention was not to bind
the implementation to Hadoop completely. It is not possible because of the
following reasons:
* Several functionalities are must because there are no alternatives,
including but not limited to login from keytab, proper TGT cache handling,
passing tokens to Hadoop services like HDFS, HBase, Hive, etc.
* The partial win is that the whole delegation token framework is going to
be initiated if hadoop-common is on classpath (Hadoop is optional in core
libraries)
The possibility to eliminate Credentials from API could be:
* to convert Credentials to byte

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework


I have a few question that I'd appreciate if you could answer them.

1. How does the Provider know whether it is required or not?
2. How does the configuration of Providers work (how do they get access
   to a configuration)?
3. How does a user select providers? (Is it purely based on the
   provider being on the classpath?)
4. How can a user override an existing provider?
5. What is DelegationTokenProvider#name() used for?
6. What happens if the names of 2 providers are identical?
7. Will we directly load the provider, or first load a factory (usually
   preferable)?
8. What is the Credentials class (it would necessarily have to be a
   public api as well)?
9. What does the TaskManager do with the received token?
10. Is there any functionality in the TaskManager that could require a
   token on startup (i.e., before registering with the RM)?

On 11/01/2022 14:58, Gabor Somogyi wrote:


Hi All,

Hope all of you have enjoyed the holiday season.

I would like to start the discussion on FLIP-211

which
aims to provide a
Kerberos delegation token framework that /obtains/renews/distributes tokens
out-of-the-box.

Please be aware that the FLIP wiki area is not fully done since the
discussion may
change the feature in major ways. The proposal can be found in a google doc
here

.
As the community agrees on the approach the content will be moved to the
wiki page.

Feel free to add your thoughts to make this feature better!

BR,
G

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-31 Thread Gabor Somogyi

Thanks for the confirmation, now it works!

G


On Mon, Jan 31, 2022 at 12:25 PM Chesnay Schepler 
wrote:

> You should have permissions now. Note that I saw 2 accounts matching
> your name, and I picked gaborgsomogyi.
>
> On 31/01/2022 11:28, Gabor Somogyi wrote:
> > Not sure if the mentioned write right already given or not but I still
> > don't see any edit button.
> >
> > G
> >
> >
> > On Fri, Jan 28, 2022 at 5:08 PM Gabor Somogyi  >
> > wrote:
> >
> >> Hi Robert,
> >>
> >> That would be awesome.
> >>
> >> My cwiki username: gaborgsomogyi
> >>
> >> G
> >>
> >>
> >> On Fri, Jan 28, 2022 at 5:06 PM Robert Metzger 
> >> wrote:
> >>
> >>> Hey Gabor,
> >>>
> >>> let me know your cwiki username, and I can give you write permissions.
> >>>
> >>>
> >>> On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> >>> wrote:
> >>>
>  Thanks for making the design better! No further thing to discuss from
> my
>  side.
> 
>  Started to reflect the agreement in the FLIP doc.
>  Since I don't have access to the wiki I need to ask Marci to do that
> >>> which
>  may take some time.
> 
>  G
> 
> 
>  On Fri, Jan 28, 2022 at 3:52 PM David Morávek 
> wrote:
> 
> > Hi,
> >
> > AFAIU an under registration TM is not added to the registered TMs map
>  until
> >> RegistrationResponse ..
> >>
> > I think you're right, with a careful design around threading
> >>> (delegating
> > update broadcasts to the main thread) + synchronous initial update
> >>> (that
> > would be nice to avoid) this should be doable.
> >
> > Not sure what you mean "we can't register the TM without providing it
>  with
> >> token" but in unsecure configuration registration must happen w/o
>  tokens.
> > Exactly as you describe it, this was meant only for the "kerberized /
> > secured" cluster case, in other cases we wouldn't enforce a non-null
>  token
> > in the response
> >
> > I think this is a good idea in general.
> > +1
> >
> > If you don't have any more thoughts on the RPC / lifecycle part, can
> >>> you
> > please reflect it into the FLIP?
> >
> > D.
> >
> > On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <
> >>> gabor.g.somo...@gmail.com
> > wrote:
> >
> >>> - Make sure DTs issued by single DTMs are monotonically increasing
>  (can
> >> be
> >> sorted on TM side)
> >>
> >> AFAIU an under registration TM is not added to the registered TMs
> >>> map
> > until
> >> RegistrationResponse
> >> is processed which would contain the initial tokens. If that's true
>  then
> >> how is it possible to have race with
> >> DTM update which is working on the registered TMs list?
> >> To be more specific "taskExecutors" is the registered map of TMs to
>  which
> >> DTM can send updated tokens
> >> but this doesn't contain the under registration TM while
> >> RegistrationResponse is not processed, right?
> >>
> >> Of course if DTM can update while RegistrationResponse is processed
>  then
> >> somehow sorting would be
> >> required and that case I would agree.
> >>
> >> - Scope DT updates by the RM ID and ensure that TM only accepts
> >>> update
> > from
> >> the current leader
> >>
> >> I've planned this initially the mentioned way so agreed.
> >>
> >> - Return initial token with the RegistrationResponse, which should
> >>> make
> > the
> >> RPC contract bit clearer (ensure that we can't register the TM
> >>> without
> >> providing it with token)
> >>
> >> I think this is a good idea in general. Not sure what you mean "we
>  can't
> >> register the TM without
> >> providing it with token" but in unsecure configuration registration
>  must
> >> happen w/o tokens.
> >> All in all the newly added tokens field must be somehow optional.
> >>
> >> G
> >>
> >>
> >> On Fri, Jan 28, 2022 at 2:22 PM David Morávek 
> >>> wrote:
> >>> We had a long discussion with Chesnay about the possible edge
> >>> cases
>  and
> >> it
> >>> basically boils down to the following two scenarios:
> >>>
> >>> 1) There is a possible race condition between TM registration (the
> > first
> >> DT
> >>> update) and token refresh if they happen simultaneously. Than the
> >>> registration might beat the refreshed token. This could be easily
> >> addressed
> >>> if DTs could be sorted (eg. by the expiration time) on the TM
> >>> side.
>  In
> >>> other words, if there are multiple updates at the same time we
> >>> need
>  to
> >> make
> >>> sure that we have a deterministic way of choosing the latest one.
> >>>
> >>> One idea by Chesnay that popped up during this discussion was
> >>> whether
> > we
> >>> could simply return the initial token with the
> >>>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-31 Thread Chesnay Schepler

You should have permissions now. Note that I saw 2 accounts matching 
your name, and I picked gaborgsomogyi.


On 31/01/2022 11:28, Gabor Somogyi wrote:

Not sure if the mentioned write right already given or not but I still
don't see any edit button.

G


On Fri, Jan 28, 2022 at 5:08 PM Gabor Somogyi 
wrote:


Hi Robert,

That would be awesome.

My cwiki username: gaborgsomogyi

G


On Fri, Jan 28, 2022 at 5:06 PM Robert Metzger 
wrote:


Hey Gabor,

let me know your cwiki username, and I can give you write permissions.


On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi 
wrote:


Thanks for making the design better! No further thing to discuss from my
side.

Started to reflect the agreement in the FLIP doc.
Since I don't have access to the wiki I need to ask Marci to do that

which

may take some time.

G


On Fri, Jan 28, 2022 at 3:52 PM David Morávek  wrote:


Hi,

AFAIU an under registration TM is not added to the registered TMs map

until

RegistrationResponse ..


I think you're right, with a careful design around threading

(delegating

update broadcasts to the main thread) + synchronous initial update

(that

would be nice to avoid) this should be doable.

Not sure what you mean "we can't register the TM without providing it

with

token" but in unsecure configuration registration must happen w/o

tokens.

Exactly as you describe it, this was meant only for the "kerberized /
secured" cluster case, in other cases we wouldn't enforce a non-null

token

in the response

I think this is a good idea in general.
+1

If you don't have any more thoughts on the RPC / lifecycle part, can

you

please reflect it into the FLIP?

D.

On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:


- Make sure DTs issued by single DTMs are monotonically increasing

(can

be
sorted on TM side)

AFAIU an under registration TM is not added to the registered TMs

map

until

RegistrationResponse
is processed which would contain the initial tokens. If that's true

then

how is it possible to have race with
DTM update which is working on the registered TMs list?
To be more specific "taskExecutors" is the registered map of TMs to

which

DTM can send updated tokens
but this doesn't contain the under registration TM while
RegistrationResponse is not processed, right?

Of course if DTM can update while RegistrationResponse is processed

then

somehow sorting would be
required and that case I would agree.

- Scope DT updates by the RM ID and ensure that TM only accepts

update

from

the current leader

I've planned this initially the mentioned way so agreed.

- Return initial token with the RegistrationResponse, which should

make

the

RPC contract bit clearer (ensure that we can't register the TM

without

providing it with token)

I think this is a good idea in general. Not sure what you mean "we

can't

register the TM without
providing it with token" but in unsecure configuration registration

must

happen w/o tokens.
All in all the newly added tokens field must be somehow optional.

G


On Fri, Jan 28, 2022 at 2:22 PM David Morávek 

wrote:

We had a long discussion with Chesnay about the possible edge

cases

and

it

basically boils down to the following two scenarios:

1) There is a possible race condition between TM registration (the

first

DT

update) and token refresh if they happen simultaneously. Than the
registration might beat the refreshed token. This could be easily

addressed

if DTs could be sorted (eg. by the expiration time) on the TM

side.

In

other words, if there are multiple updates at the same time we

need

to

make

sure that we have a deterministic way of choosing the latest one.

One idea by Chesnay that popped up during this discussion was

whether

we

could simply return the initial token with the

RegistrationResponse

to

avoid making an extra call during the TM registration.

2) When the RM leadership changes (eg. because zookeeper session

times

out)

there might be a race condition where the old RM is shutting down

and

updates the tokens, that it might again beat the registration

token

of

the

new RM. This could be avoided if we scope the token by

_ResourceManagerId_

and only accept updates for the current leader (basically we'd

have

an

extra parameter to the _updateDelegationToken_ method).

-

DTM is way simpler then for example slot management, which could

receive

updates from the JobMaster that RM might not know about.

So if you want to go in the path you're describing it should be

doable

and

we'd propose following to cover all cases:

- Make sure DTs issued by single DTMs are monotonically increasing

(can

be

sorted on TM side)
- Scope DT updates by the RM ID and ensure that TM only accepts

update

from

the current leader
- Return initial token with the RegistrationResponse, which should

make

the

RPC contract bit clearer (ensure that we can't register the TM

without

providing it with token)

Any thoughts?


On Fri, Jan 28, 2022 at

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-31 Thread Gabor Somogyi

Not sure if the mentioned write right already given or not but I still
don't see any edit button.

G


On Fri, Jan 28, 2022 at 5:08 PM Gabor Somogyi 
wrote:

> Hi Robert,
>
> That would be awesome.
>
> My cwiki username: gaborgsomogyi
>
> G
>
>
> On Fri, Jan 28, 2022 at 5:06 PM Robert Metzger 
> wrote:
>
>> Hey Gabor,
>>
>> let me know your cwiki username, and I can give you write permissions.
>>
>>
>> On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi 
>> wrote:
>>
>> > Thanks for making the design better! No further thing to discuss from my
>> > side.
>> >
>> > Started to reflect the agreement in the FLIP doc.
>> > Since I don't have access to the wiki I need to ask Marci to do that
>> which
>> > may take some time.
>> >
>> > G
>> >
>> >
>> > On Fri, Jan 28, 2022 at 3:52 PM David Morávek  wrote:
>> >
>> > > Hi,
>> > >
>> > > AFAIU an under registration TM is not added to the registered TMs map
>> > until
>> > > > RegistrationResponse ..
>> > > >
>> > >
>> > > I think you're right, with a careful design around threading
>> (delegating
>> > > update broadcasts to the main thread) + synchronous initial update
>> (that
>> > > would be nice to avoid) this should be doable.
>> > >
>> > > Not sure what you mean "we can't register the TM without providing it
>> > with
>> > > > token" but in unsecure configuration registration must happen w/o
>> > tokens.
>> > > >
>> > >
>> > > Exactly as you describe it, this was meant only for the "kerberized /
>> > > secured" cluster case, in other cases we wouldn't enforce a non-null
>> > token
>> > > in the response
>> > >
>> > > I think this is a good idea in general.
>> > > >
>> > >
>> > > +1
>> > >
>> > > If you don't have any more thoughts on the RPC / lifecycle part, can
>> you
>> > > please reflect it into the FLIP?
>> > >
>> > > D.
>> > >
>> > > On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <
>> gabor.g.somo...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > > - Make sure DTs issued by single DTMs are monotonically increasing
>> > (can
>> > > > be
>> > > > sorted on TM side)
>> > > >
>> > > > AFAIU an under registration TM is not added to the registered TMs
>> map
>> > > until
>> > > > RegistrationResponse
>> > > > is processed which would contain the initial tokens. If that's true
>> > then
>> > > > how is it possible to have race with
>> > > > DTM update which is working on the registered TMs list?
>> > > > To be more specific "taskExecutors" is the registered map of TMs to
>> > which
>> > > > DTM can send updated tokens
>> > > > but this doesn't contain the under registration TM while
>> > > > RegistrationResponse is not processed, right?
>> > > >
>> > > > Of course if DTM can update while RegistrationResponse is processed
>> > then
>> > > > somehow sorting would be
>> > > > required and that case I would agree.
>> > > >
>> > > > - Scope DT updates by the RM ID and ensure that TM only accepts
>> update
>> > > from
>> > > > the current leader
>> > > >
>> > > > I've planned this initially the mentioned way so agreed.
>> > > >
>> > > > - Return initial token with the RegistrationResponse, which should
>> make
>> > > the
>> > > > RPC contract bit clearer (ensure that we can't register the TM
>> without
>> > > > providing it with token)
>> > > >
>> > > > I think this is a good idea in general. Not sure what you mean "we
>> > can't
>> > > > register the TM without
>> > > > providing it with token" but in unsecure configuration registration
>> > must
>> > > > happen w/o tokens.
>> > > > All in all the newly added tokens field must be somehow optional.
>> > > >
>> > > > G
>> > > >
>> > > >
>> > > > On Fri, Jan 28, 2022 at 2:22 PM David Morávek 
>> wrote:
>> > > >
>> > > > > We had a long discussion with Chesnay about the possible edge
>> cases
>> > and
>> > > > it
>> > > > > basically boils down to the following two scenarios:
>> > > > >
>> > > > > 1) There is a possible race condition between TM registration (the
>> > > first
>> > > > DT
>> > > > > update) and token refresh if they happen simultaneously. Than the
>> > > > > registration might beat the refreshed token. This could be easily
>> > > > addressed
>> > > > > if DTs could be sorted (eg. by the expiration time) on the TM
>> side.
>> > In
>> > > > > other words, if there are multiple updates at the same time we
>> need
>> > to
>> > > > make
>> > > > > sure that we have a deterministic way of choosing the latest one.
>> > > > >
>> > > > > One idea by Chesnay that popped up during this discussion was
>> whether
>> > > we
>> > > > > could simply return the initial token with the
>> RegistrationResponse
>> > to
>> > > > > avoid making an extra call during the TM registration.
>> > > > >
>> > > > > 2) When the RM leadership changes (eg. because zookeeper session
>> > times
>> > > > out)
>> > > > > there might be a race condition where the old RM is shutting down
>> and
>> > > > > updates the tokens, that it might again beat the registration
>> token
>> > of
>> > > > the
>> > > > > new RM. This could be avoided if we scope

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Hi Robert,

That would be awesome.

My cwiki username: gaborgsomogyi

G


On Fri, Jan 28, 2022 at 5:06 PM Robert Metzger  wrote:

> Hey Gabor,
>
> let me know your cwiki username, and I can give you write permissions.
>
>
> On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi 
> wrote:
>
> > Thanks for making the design better! No further thing to discuss from my
> > side.
> >
> > Started to reflect the agreement in the FLIP doc.
> > Since I don't have access to the wiki I need to ask Marci to do that
> which
> > may take some time.
> >
> > G
> >
> >
> > On Fri, Jan 28, 2022 at 3:52 PM David Morávek  wrote:
> >
> > > Hi,
> > >
> > > AFAIU an under registration TM is not added to the registered TMs map
> > until
> > > > RegistrationResponse ..
> > > >
> > >
> > > I think you're right, with a careful design around threading
> (delegating
> > > update broadcasts to the main thread) + synchronous initial update
> (that
> > > would be nice to avoid) this should be doable.
> > >
> > > Not sure what you mean "we can't register the TM without providing it
> > with
> > > > token" but in unsecure configuration registration must happen w/o
> > tokens.
> > > >
> > >
> > > Exactly as you describe it, this was meant only for the "kerberized /
> > > secured" cluster case, in other cases we wouldn't enforce a non-null
> > token
> > > in the response
> > >
> > > I think this is a good idea in general.
> > > >
> > >
> > > +1
> > >
> > > If you don't have any more thoughts on the RPC / lifecycle part, can
> you
> > > please reflect it into the FLIP?
> > >
> > > D.
> > >
> > > On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >
> > > wrote:
> > >
> > > > > - Make sure DTs issued by single DTMs are monotonically increasing
> > (can
> > > > be
> > > > sorted on TM side)
> > > >
> > > > AFAIU an under registration TM is not added to the registered TMs map
> > > until
> > > > RegistrationResponse
> > > > is processed which would contain the initial tokens. If that's true
> > then
> > > > how is it possible to have race with
> > > > DTM update which is working on the registered TMs list?
> > > > To be more specific "taskExecutors" is the registered map of TMs to
> > which
> > > > DTM can send updated tokens
> > > > but this doesn't contain the under registration TM while
> > > > RegistrationResponse is not processed, right?
> > > >
> > > > Of course if DTM can update while RegistrationResponse is processed
> > then
> > > > somehow sorting would be
> > > > required and that case I would agree.
> > > >
> > > > - Scope DT updates by the RM ID and ensure that TM only accepts
> update
> > > from
> > > > the current leader
> > > >
> > > > I've planned this initially the mentioned way so agreed.
> > > >
> > > > - Return initial token with the RegistrationResponse, which should
> make
> > > the
> > > > RPC contract bit clearer (ensure that we can't register the TM
> without
> > > > providing it with token)
> > > >
> > > > I think this is a good idea in general. Not sure what you mean "we
> > can't
> > > > register the TM without
> > > > providing it with token" but in unsecure configuration registration
> > must
> > > > happen w/o tokens.
> > > > All in all the newly added tokens field must be somehow optional.
> > > >
> > > > G
> > > >
> > > >
> > > > On Fri, Jan 28, 2022 at 2:22 PM David Morávek 
> wrote:
> > > >
> > > > > We had a long discussion with Chesnay about the possible edge cases
> > and
> > > > it
> > > > > basically boils down to the following two scenarios:
> > > > >
> > > > > 1) There is a possible race condition between TM registration (the
> > > first
> > > > DT
> > > > > update) and token refresh if they happen simultaneously. Than the
> > > > > registration might beat the refreshed token. This could be easily
> > > > addressed
> > > > > if DTs could be sorted (eg. by the expiration time) on the TM side.
> > In
> > > > > other words, if there are multiple updates at the same time we need
> > to
> > > > make
> > > > > sure that we have a deterministic way of choosing the latest one.
> > > > >
> > > > > One idea by Chesnay that popped up during this discussion was
> whether
> > > we
> > > > > could simply return the initial token with the RegistrationResponse
> > to
> > > > > avoid making an extra call during the TM registration.
> > > > >
> > > > > 2) When the RM leadership changes (eg. because zookeeper session
> > times
> > > > out)
> > > > > there might be a race condition where the old RM is shutting down
> and
> > > > > updates the tokens, that it might again beat the registration token
> > of
> > > > the
> > > > > new RM. This could be avoided if we scope the token by
> > > > _ResourceManagerId_
> > > > > and only accept updates for the current leader (basically we'd have
> > an
> > > > > extra parameter to the _updateDelegationToken_ method).
> > > > >
> > > > > -
> > > > >
> > > > > DTM is way simpler then for example slot management, which could
> > > receive
> > > > > updates from the

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

We've made the changes both in the doc + wiki.
Please have a look and notify me if I've missed something based on our
agreement.

G


On Fri, Jan 28, 2022 at 4:04 PM Gabor Somogyi 
wrote:

> Thanks for making the design better! No further thing to discuss from my
> side.
>
> Started to reflect the agreement in the FLIP doc.
> Since I don't have access to the wiki I need to ask Marci to do that which
> may take some time.
>
> G
>
>
> On Fri, Jan 28, 2022 at 3:52 PM David Morávek  wrote:
>
>> Hi,
>>
>> AFAIU an under registration TM is not added to the registered TMs map
>> until
>> > RegistrationResponse ..
>> >
>>
>> I think you're right, with a careful design around threading (delegating
>> update broadcasts to the main thread) + synchronous initial update (that
>> would be nice to avoid) this should be doable.
>>
>> Not sure what you mean "we can't register the TM without providing it with
>> > token" but in unsecure configuration registration must happen w/o
>> tokens.
>> >
>>
>> Exactly as you describe it, this was meant only for the "kerberized /
>> secured" cluster case, in other cases we wouldn't enforce a non-null token
>> in the response
>>
>> I think this is a good idea in general.
>> >
>>
>> +1
>>
>> If you don't have any more thoughts on the RPC / lifecycle part, can you
>> please reflect it into the FLIP?
>>
>> D.
>>
>> On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi 
>> wrote:
>>
>> > > - Make sure DTs issued by single DTMs are monotonically increasing
>> (can
>> > be
>> > sorted on TM side)
>> >
>> > AFAIU an under registration TM is not added to the registered TMs map
>> until
>> > RegistrationResponse
>> > is processed which would contain the initial tokens. If that's true then
>> > how is it possible to have race with
>> > DTM update which is working on the registered TMs list?
>> > To be more specific "taskExecutors" is the registered map of TMs to
>> which
>> > DTM can send updated tokens
>> > but this doesn't contain the under registration TM while
>> > RegistrationResponse is not processed, right?
>> >
>> > Of course if DTM can update while RegistrationResponse is processed then
>> > somehow sorting would be
>> > required and that case I would agree.
>> >
>> > - Scope DT updates by the RM ID and ensure that TM only accepts update
>> from
>> > the current leader
>> >
>> > I've planned this initially the mentioned way so agreed.
>> >
>> > - Return initial token with the RegistrationResponse, which should make
>> the
>> > RPC contract bit clearer (ensure that we can't register the TM without
>> > providing it with token)
>> >
>> > I think this is a good idea in general. Not sure what you mean "we can't
>> > register the TM without
>> > providing it with token" but in unsecure configuration registration must
>> > happen w/o tokens.
>> > All in all the newly added tokens field must be somehow optional.
>> >
>> > G
>> >
>> >
>> > On Fri, Jan 28, 2022 at 2:22 PM David Morávek  wrote:
>> >
>> > > We had a long discussion with Chesnay about the possible edge cases
>> and
>> > it
>> > > basically boils down to the following two scenarios:
>> > >
>> > > 1) There is a possible race condition between TM registration (the
>> first
>> > DT
>> > > update) and token refresh if they happen simultaneously. Than the
>> > > registration might beat the refreshed token. This could be easily
>> > addressed
>> > > if DTs could be sorted (eg. by the expiration time) on the TM side. In
>> > > other words, if there are multiple updates at the same time we need to
>> > make
>> > > sure that we have a deterministic way of choosing the latest one.
>> > >
>> > > One idea by Chesnay that popped up during this discussion was whether
>> we
>> > > could simply return the initial token with the RegistrationResponse to
>> > > avoid making an extra call during the TM registration.
>> > >
>> > > 2) When the RM leadership changes (eg. because zookeeper session times
>> > out)
>> > > there might be a race condition where the old RM is shutting down and
>> > > updates the tokens, that it might again beat the registration token of
>> > the
>> > > new RM. This could be avoided if we scope the token by
>> > _ResourceManagerId_
>> > > and only accept updates for the current leader (basically we'd have an
>> > > extra parameter to the _updateDelegationToken_ method).
>> > >
>> > > -
>> > >
>> > > DTM is way simpler then for example slot management, which could
>> receive
>> > > updates from the JobMaster that RM might not know about.
>> > >
>> > > So if you want to go in the path you're describing it should be doable
>> > and
>> > > we'd propose following to cover all cases:
>> > >
>> > > - Make sure DTs issued by single DTMs are monotonically increasing
>> (can
>> > be
>> > > sorted on TM side)
>> > > - Scope DT updates by the RM ID and ensure that TM only accepts update
>> > from
>> > > the current leader
>> > > - Return initial token with the RegistrationResponse, which should
>> make
>> > the
>> > > RPC contract bit clearer

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread Robert Metzger

Hey Gabor,

let me know your cwiki username, and I can give you write permissions.


On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi 
wrote:

> Thanks for making the design better! No further thing to discuss from my
> side.
>
> Started to reflect the agreement in the FLIP doc.
> Since I don't have access to the wiki I need to ask Marci to do that which
> may take some time.
>
> G
>
>
> On Fri, Jan 28, 2022 at 3:52 PM David Morávek  wrote:
>
> > Hi,
> >
> > AFAIU an under registration TM is not added to the registered TMs map
> until
> > > RegistrationResponse ..
> > >
> >
> > I think you're right, with a careful design around threading (delegating
> > update broadcasts to the main thread) + synchronous initial update (that
> > would be nice to avoid) this should be doable.
> >
> > Not sure what you mean "we can't register the TM without providing it
> with
> > > token" but in unsecure configuration registration must happen w/o
> tokens.
> > >
> >
> > Exactly as you describe it, this was meant only for the "kerberized /
> > secured" cluster case, in other cases we wouldn't enforce a non-null
> token
> > in the response
> >
> > I think this is a good idea in general.
> > >
> >
> > +1
> >
> > If you don't have any more thoughts on the RPC / lifecycle part, can you
> > please reflect it into the FLIP?
> >
> > D.
> >
> > On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi  >
> > wrote:
> >
> > > > - Make sure DTs issued by single DTMs are monotonically increasing
> (can
> > > be
> > > sorted on TM side)
> > >
> > > AFAIU an under registration TM is not added to the registered TMs map
> > until
> > > RegistrationResponse
> > > is processed which would contain the initial tokens. If that's true
> then
> > > how is it possible to have race with
> > > DTM update which is working on the registered TMs list?
> > > To be more specific "taskExecutors" is the registered map of TMs to
> which
> > > DTM can send updated tokens
> > > but this doesn't contain the under registration TM while
> > > RegistrationResponse is not processed, right?
> > >
> > > Of course if DTM can update while RegistrationResponse is processed
> then
> > > somehow sorting would be
> > > required and that case I would agree.
> > >
> > > - Scope DT updates by the RM ID and ensure that TM only accepts update
> > from
> > > the current leader
> > >
> > > I've planned this initially the mentioned way so agreed.
> > >
> > > - Return initial token with the RegistrationResponse, which should make
> > the
> > > RPC contract bit clearer (ensure that we can't register the TM without
> > > providing it with token)
> > >
> > > I think this is a good idea in general. Not sure what you mean "we
> can't
> > > register the TM without
> > > providing it with token" but in unsecure configuration registration
> must
> > > happen w/o tokens.
> > > All in all the newly added tokens field must be somehow optional.
> > >
> > > G
> > >
> > >
> > > On Fri, Jan 28, 2022 at 2:22 PM David Morávek  wrote:
> > >
> > > > We had a long discussion with Chesnay about the possible edge cases
> and
> > > it
> > > > basically boils down to the following two scenarios:
> > > >
> > > > 1) There is a possible race condition between TM registration (the
> > first
> > > DT
> > > > update) and token refresh if they happen simultaneously. Than the
> > > > registration might beat the refreshed token. This could be easily
> > > addressed
> > > > if DTs could be sorted (eg. by the expiration time) on the TM side.
> In
> > > > other words, if there are multiple updates at the same time we need
> to
> > > make
> > > > sure that we have a deterministic way of choosing the latest one.
> > > >
> > > > One idea by Chesnay that popped up during this discussion was whether
> > we
> > > > could simply return the initial token with the RegistrationResponse
> to
> > > > avoid making an extra call during the TM registration.
> > > >
> > > > 2) When the RM leadership changes (eg. because zookeeper session
> times
> > > out)
> > > > there might be a race condition where the old RM is shutting down and
> > > > updates the tokens, that it might again beat the registration token
> of
> > > the
> > > > new RM. This could be avoided if we scope the token by
> > > _ResourceManagerId_
> > > > and only accept updates for the current leader (basically we'd have
> an
> > > > extra parameter to the _updateDelegationToken_ method).
> > > >
> > > > -
> > > >
> > > > DTM is way simpler then for example slot management, which could
> > receive
> > > > updates from the JobMaster that RM might not know about.
> > > >
> > > > So if you want to go in the path you're describing it should be
> doable
> > > and
> > > > we'd propose following to cover all cases:
> > > >
> > > > - Make sure DTs issued by single DTMs are monotonically increasing
> (can
> > > be
> > > > sorted on TM side)
> > > > - Scope DT updates by the RM ID and ensure that TM only accepts
> update
> > > from
> > > > the current leader
> > > > - Return initial token

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Thanks for making the design better! No further thing to discuss from my
side.

Started to reflect the agreement in the FLIP doc.
Since I don't have access to the wiki I need to ask Marci to do that which
may take some time.

G


On Fri, Jan 28, 2022 at 3:52 PM David Morávek  wrote:

> Hi,
>
> AFAIU an under registration TM is not added to the registered TMs map until
> > RegistrationResponse ..
> >
>
> I think you're right, with a careful design around threading (delegating
> update broadcasts to the main thread) + synchronous initial update (that
> would be nice to avoid) this should be doable.
>
> Not sure what you mean "we can't register the TM without providing it with
> > token" but in unsecure configuration registration must happen w/o tokens.
> >
>
> Exactly as you describe it, this was meant only for the "kerberized /
> secured" cluster case, in other cases we wouldn't enforce a non-null token
> in the response
>
> I think this is a good idea in general.
> >
>
> +1
>
> If you don't have any more thoughts on the RPC / lifecycle part, can you
> please reflect it into the FLIP?
>
> D.
>
> On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi 
> wrote:
>
> > > - Make sure DTs issued by single DTMs are monotonically increasing (can
> > be
> > sorted on TM side)
> >
> > AFAIU an under registration TM is not added to the registered TMs map
> until
> > RegistrationResponse
> > is processed which would contain the initial tokens. If that's true then
> > how is it possible to have race with
> > DTM update which is working on the registered TMs list?
> > To be more specific "taskExecutors" is the registered map of TMs to which
> > DTM can send updated tokens
> > but this doesn't contain the under registration TM while
> > RegistrationResponse is not processed, right?
> >
> > Of course if DTM can update while RegistrationResponse is processed then
> > somehow sorting would be
> > required and that case I would agree.
> >
> > - Scope DT updates by the RM ID and ensure that TM only accepts update
> from
> > the current leader
> >
> > I've planned this initially the mentioned way so agreed.
> >
> > - Return initial token with the RegistrationResponse, which should make
> the
> > RPC contract bit clearer (ensure that we can't register the TM without
> > providing it with token)
> >
> > I think this is a good idea in general. Not sure what you mean "we can't
> > register the TM without
> > providing it with token" but in unsecure configuration registration must
> > happen w/o tokens.
> > All in all the newly added tokens field must be somehow optional.
> >
> > G
> >
> >
> > On Fri, Jan 28, 2022 at 2:22 PM David Morávek  wrote:
> >
> > > We had a long discussion with Chesnay about the possible edge cases and
> > it
> > > basically boils down to the following two scenarios:
> > >
> > > 1) There is a possible race condition between TM registration (the
> first
> > DT
> > > update) and token refresh if they happen simultaneously. Than the
> > > registration might beat the refreshed token. This could be easily
> > addressed
> > > if DTs could be sorted (eg. by the expiration time) on the TM side. In
> > > other words, if there are multiple updates at the same time we need to
> > make
> > > sure that we have a deterministic way of choosing the latest one.
> > >
> > > One idea by Chesnay that popped up during this discussion was whether
> we
> > > could simply return the initial token with the RegistrationResponse to
> > > avoid making an extra call during the TM registration.
> > >
> > > 2) When the RM leadership changes (eg. because zookeeper session times
> > out)
> > > there might be a race condition where the old RM is shutting down and
> > > updates the tokens, that it might again beat the registration token of
> > the
> > > new RM. This could be avoided if we scope the token by
> > _ResourceManagerId_
> > > and only accept updates for the current leader (basically we'd have an
> > > extra parameter to the _updateDelegationToken_ method).
> > >
> > > -
> > >
> > > DTM is way simpler then for example slot management, which could
> receive
> > > updates from the JobMaster that RM might not know about.
> > >
> > > So if you want to go in the path you're describing it should be doable
> > and
> > > we'd propose following to cover all cases:
> > >
> > > - Make sure DTs issued by single DTMs are monotonically increasing (can
> > be
> > > sorted on TM side)
> > > - Scope DT updates by the RM ID and ensure that TM only accepts update
> > from
> > > the current leader
> > > - Return initial token with the RegistrationResponse, which should make
> > the
> > > RPC contract bit clearer (ensure that we can't register the TM without
> > > providing it with token)
> > >
> > > Any thoughts?
> > >
> > >
> > > On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi <
> > gabor.g.somo...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for investing your time!
> > > >
> > > > The first 2 bulletpoint are clear.
> > > > If there is a chance that a TM can

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread David Morávek

Hi,

AFAIU an under registration TM is not added to the registered TMs map until
> RegistrationResponse ..
>

I think you're right, with a careful design around threading (delegating
update broadcasts to the main thread) + synchronous initial update (that
would be nice to avoid) this should be doable.

Not sure what you mean "we can't register the TM without providing it with
> token" but in unsecure configuration registration must happen w/o tokens.
>

Exactly as you describe it, this was meant only for the "kerberized /
secured" cluster case, in other cases we wouldn't enforce a non-null token
in the response

I think this is a good idea in general.
>

+1

If you don't have any more thoughts on the RPC / lifecycle part, can you
please reflect it into the FLIP?

D.

On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi 
wrote:

> > - Make sure DTs issued by single DTMs are monotonically increasing (can
> be
> sorted on TM side)
>
> AFAIU an under registration TM is not added to the registered TMs map until
> RegistrationResponse
> is processed which would contain the initial tokens. If that's true then
> how is it possible to have race with
> DTM update which is working on the registered TMs list?
> To be more specific "taskExecutors" is the registered map of TMs to which
> DTM can send updated tokens
> but this doesn't contain the under registration TM while
> RegistrationResponse is not processed, right?
>
> Of course if DTM can update while RegistrationResponse is processed then
> somehow sorting would be
> required and that case I would agree.
>
> - Scope DT updates by the RM ID and ensure that TM only accepts update from
> the current leader
>
> I've planned this initially the mentioned way so agreed.
>
> - Return initial token with the RegistrationResponse, which should make the
> RPC contract bit clearer (ensure that we can't register the TM without
> providing it with token)
>
> I think this is a good idea in general. Not sure what you mean "we can't
> register the TM without
> providing it with token" but in unsecure configuration registration must
> happen w/o tokens.
> All in all the newly added tokens field must be somehow optional.
>
> G
>
>
> On Fri, Jan 28, 2022 at 2:22 PM David Morávek  wrote:
>
> > We had a long discussion with Chesnay about the possible edge cases and
> it
> > basically boils down to the following two scenarios:
> >
> > 1) There is a possible race condition between TM registration (the first
> DT
> > update) and token refresh if they happen simultaneously. Than the
> > registration might beat the refreshed token. This could be easily
> addressed
> > if DTs could be sorted (eg. by the expiration time) on the TM side. In
> > other words, if there are multiple updates at the same time we need to
> make
> > sure that we have a deterministic way of choosing the latest one.
> >
> > One idea by Chesnay that popped up during this discussion was whether we
> > could simply return the initial token with the RegistrationResponse to
> > avoid making an extra call during the TM registration.
> >
> > 2) When the RM leadership changes (eg. because zookeeper session times
> out)
> > there might be a race condition where the old RM is shutting down and
> > updates the tokens, that it might again beat the registration token of
> the
> > new RM. This could be avoided if we scope the token by
> _ResourceManagerId_
> > and only accept updates for the current leader (basically we'd have an
> > extra parameter to the _updateDelegationToken_ method).
> >
> > -
> >
> > DTM is way simpler then for example slot management, which could receive
> > updates from the JobMaster that RM might not know about.
> >
> > So if you want to go in the path you're describing it should be doable
> and
> > we'd propose following to cover all cases:
> >
> > - Make sure DTs issued by single DTMs are monotonically increasing (can
> be
> > sorted on TM side)
> > - Scope DT updates by the RM ID and ensure that TM only accepts update
> from
> > the current leader
> > - Return initial token with the RegistrationResponse, which should make
> the
> > RPC contract bit clearer (ensure that we can't register the TM without
> > providing it with token)
> >
> > Any thoughts?
> >
> >
> > On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> > wrote:
> >
> > > Thanks for investing your time!
> > >
> > > The first 2 bulletpoint are clear.
> > > If there is a chance that a TM can go to an inconsistent state then I
> > agree
> > > with the 3rd bulletpoint.
> > > Just before we agree on that I would like to learn something new and
> > > understand how is it possible that a TM
> > > gets corrupted? (In Spark I've never seen such thing and no mechanism
> to
> > > fix this but Flink is definitely not Spark)
> > >
> > > Here is my understanding:
> > > * DTM pushes new obtained DTs to TMs and if any exception occurs then a
> > > retry after "security.kerberos.tokens.retry-wait"
> > > happens. This means DTM retries

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

> - Make sure DTs issued by single DTMs are monotonically increasing (can be
sorted on TM side)

AFAIU an under registration TM is not added to the registered TMs map until
RegistrationResponse
is processed which would contain the initial tokens. If that's true then
how is it possible to have race with
DTM update which is working on the registered TMs list?
To be more specific "taskExecutors" is the registered map of TMs to which
DTM can send updated tokens
but this doesn't contain the under registration TM while
RegistrationResponse is not processed, right?

Of course if DTM can update while RegistrationResponse is processed then
somehow sorting would be
required and that case I would agree.

- Scope DT updates by the RM ID and ensure that TM only accepts update from
the current leader

I've planned this initially the mentioned way so agreed.

- Return initial token with the RegistrationResponse, which should make the
RPC contract bit clearer (ensure that we can't register the TM without
providing it with token)

I think this is a good idea in general. Not sure what you mean "we can't
register the TM without
providing it with token" but in unsecure configuration registration must
happen w/o tokens.
All in all the newly added tokens field must be somehow optional.

G


On Fri, Jan 28, 2022 at 2:22 PM David Morávek  wrote:

> We had a long discussion with Chesnay about the possible edge cases and it
> basically boils down to the following two scenarios:
>
> 1) There is a possible race condition between TM registration (the first DT
> update) and token refresh if they happen simultaneously. Than the
> registration might beat the refreshed token. This could be easily addressed
> if DTs could be sorted (eg. by the expiration time) on the TM side. In
> other words, if there are multiple updates at the same time we need to make
> sure that we have a deterministic way of choosing the latest one.
>
> One idea by Chesnay that popped up during this discussion was whether we
> could simply return the initial token with the RegistrationResponse to
> avoid making an extra call during the TM registration.
>
> 2) When the RM leadership changes (eg. because zookeeper session times out)
> there might be a race condition where the old RM is shutting down and
> updates the tokens, that it might again beat the registration token of the
> new RM. This could be avoided if we scope the token by _ResourceManagerId_
> and only accept updates for the current leader (basically we'd have an
> extra parameter to the _updateDelegationToken_ method).
>
> -
>
> DTM is way simpler then for example slot management, which could receive
> updates from the JobMaster that RM might not know about.
>
> So if you want to go in the path you're describing it should be doable and
> we'd propose following to cover all cases:
>
> - Make sure DTs issued by single DTMs are monotonically increasing (can be
> sorted on TM side)
> - Scope DT updates by the RM ID and ensure that TM only accepts update from
> the current leader
> - Return initial token with the RegistrationResponse, which should make the
> RPC contract bit clearer (ensure that we can't register the TM without
> providing it with token)
>
> Any thoughts?
>
>
> On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi 
> wrote:
>
> > Thanks for investing your time!
> >
> > The first 2 bulletpoint are clear.
> > If there is a chance that a TM can go to an inconsistent state then I
> agree
> > with the 3rd bulletpoint.
> > Just before we agree on that I would like to learn something new and
> > understand how is it possible that a TM
> > gets corrupted? (In Spark I've never seen such thing and no mechanism to
> > fix this but Flink is definitely not Spark)
> >
> > Here is my understanding:
> > * DTM pushes new obtained DTs to TMs and if any exception occurs then a
> > retry after "security.kerberos.tokens.retry-wait"
> > happens. This means DTM retries until it's not possible to send new DTs
> to
> > all registered TMs.
> > * New TM registration must fail if "updateDelegationToken" fails
> > * "updateDelegationToken" fails consistently like a DB (at least I plan
> to
> > implement it that way).
> > If DTs are arriving on the TM side then a single
> > "UserGroupInformation.getCurrentUser.addCredentials"
> > will be called which I've never seen it failed.
> > * I hope all other code parts are not touching existing DTs within the
> JVM
> >
> > I would like to emphasize I'm not against to add it just want to see what
> > kind of problems are we facing.
> > It would ease to catch bugs earlier and help in the maintenance.
> >
> > All in all I would buy the idea to add the 3rd bullet if we foresee the
> > need.
> >
> > G
> >
> >
> > On Fri, Jan 28, 2022 at 10:07 AM David Morávek  wrote:
> >
> > > Hi Gabor,
> > >
> > > This is definitely headed in a right direction +1.
> > >
> > > I think we still need to have a safeguard in case some of the TMs gets
> > into
> > > the inconsistent state though, which will also

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread David Morávek

We had a long discussion with Chesnay about the possible edge cases and it
basically boils down to the following two scenarios:

1) There is a possible race condition between TM registration (the first DT
update) and token refresh if they happen simultaneously. Than the
registration might beat the refreshed token. This could be easily addressed
if DTs could be sorted (eg. by the expiration time) on the TM side. In
other words, if there are multiple updates at the same time we need to make
sure that we have a deterministic way of choosing the latest one.

One idea by Chesnay that popped up during this discussion was whether we
could simply return the initial token with the RegistrationResponse to
avoid making an extra call during the TM registration.

2) When the RM leadership changes (eg. because zookeeper session times out)
there might be a race condition where the old RM is shutting down and
updates the tokens, that it might again beat the registration token of the
new RM. This could be avoided if we scope the token by _ResourceManagerId_
and only accept updates for the current leader (basically we'd have an
extra parameter to the _updateDelegationToken_ method).

-

DTM is way simpler then for example slot management, which could receive
updates from the JobMaster that RM might not know about.

So if you want to go in the path you're describing it should be doable and
we'd propose following to cover all cases:

- Make sure DTs issued by single DTMs are monotonically increasing (can be
sorted on TM side)
- Scope DT updates by the RM ID and ensure that TM only accepts update from
the current leader
- Return initial token with the RegistrationResponse, which should make the
RPC contract bit clearer (ensure that we can't register the TM without
providing it with token)

Any thoughts?


On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi 
wrote:

> Thanks for investing your time!
>
> The first 2 bulletpoint are clear.
> If there is a chance that a TM can go to an inconsistent state then I agree
> with the 3rd bulletpoint.
> Just before we agree on that I would like to learn something new and
> understand how is it possible that a TM
> gets corrupted? (In Spark I've never seen such thing and no mechanism to
> fix this but Flink is definitely not Spark)
>
> Here is my understanding:
> * DTM pushes new obtained DTs to TMs and if any exception occurs then a
> retry after "security.kerberos.tokens.retry-wait"
> happens. This means DTM retries until it's not possible to send new DTs to
> all registered TMs.
> * New TM registration must fail if "updateDelegationToken" fails
> * "updateDelegationToken" fails consistently like a DB (at least I plan to
> implement it that way).
> If DTs are arriving on the TM side then a single
> "UserGroupInformation.getCurrentUser.addCredentials"
> will be called which I've never seen it failed.
> * I hope all other code parts are not touching existing DTs within the JVM
>
> I would like to emphasize I'm not against to add it just want to see what
> kind of problems are we facing.
> It would ease to catch bugs earlier and help in the maintenance.
>
> All in all I would buy the idea to add the 3rd bullet if we foresee the
> need.
>
> G
>
>
> On Fri, Jan 28, 2022 at 10:07 AM David Morávek  wrote:
>
> > Hi Gabor,
> >
> > This is definitely headed in a right direction +1.
> >
> > I think we still need to have a safeguard in case some of the TMs gets
> into
> > the inconsistent state though, which will also eliminate the need for
> > implementing a custom retry mechanism (when _updateDelegationToken_ call
> > fails for some reason).
> >
> > We already have this safeguard in place for slot pool (in case there are
> > some slots in inconsistent state - eg. we haven't freed them for some
> > reason) and for the partition tracker, which could be simply enhanced.
> This
> > is done via periodic heartbeat from TaskManagers to the ResourceManager
> > that contains report about state of these two components (from TM
> > perspective) so the RM can reconcile their state if necessary.
> >
> > I don't think adding an additional field to
> _TaskExecutorHeartbeatPayload_
> > should be a concern as we only heartbeat every ~ 10s by default and the
> new
> > field would be small compared to rest of the existing payload. Also
> > heartbeat doesn't need to contain the whole DT, but just some identifier
> > which signals whether it uses the right one, that could be significantly
> > smaller.
> >
> > This is still a PUSH based approach as the RM would again call the newly
> > introduced _updateDelegationToken_ when it encounters inconsistency (eg.
> > due to a temporary network partition / a race condition we didn't test
> for
> > / some other scenario we didn't think about). In practice these
> > inconsistencies are super hard to avoid and reason about (and
> unfortunately
> > yes, we see them happen from time to time), so reusing the existing
> > mechanism that is designed for this exact problem simplify

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Thanks for investing your time!

The first 2 bulletpoint are clear.
If there is a chance that a TM can go to an inconsistent state then I agree
with the 3rd bulletpoint.
Just before we agree on that I would like to learn something new and
understand how is it possible that a TM
gets corrupted? (In Spark I've never seen such thing and no mechanism to
fix this but Flink is definitely not Spark)

Here is my understanding:
* DTM pushes new obtained DTs to TMs and if any exception occurs then a
retry after "security.kerberos.tokens.retry-wait"
happens. This means DTM retries until it's not possible to send new DTs to
all registered TMs.
* New TM registration must fail if "updateDelegationToken" fails
* "updateDelegationToken" fails consistently like a DB (at least I plan to
implement it that way).
If DTs are arriving on the TM side then a single
"UserGroupInformation.getCurrentUser.addCredentials"
will be called which I've never seen it failed.
* I hope all other code parts are not touching existing DTs within the JVM

I would like to emphasize I'm not against to add it just want to see what
kind of problems are we facing.
It would ease to catch bugs earlier and help in the maintenance.

All in all I would buy the idea to add the 3rd bullet if we foresee the
need.

G


On Fri, Jan 28, 2022 at 10:07 AM David Morávek  wrote:

> Hi Gabor,
>
> This is definitely headed in a right direction +1.
>
> I think we still need to have a safeguard in case some of the TMs gets into
> the inconsistent state though, which will also eliminate the need for
> implementing a custom retry mechanism (when _updateDelegationToken_ call
> fails for some reason).
>
> We already have this safeguard in place for slot pool (in case there are
> some slots in inconsistent state - eg. we haven't freed them for some
> reason) and for the partition tracker, which could be simply enhanced. This
> is done via periodic heartbeat from TaskManagers to the ResourceManager
> that contains report about state of these two components (from TM
> perspective) so the RM can reconcile their state if necessary.
>
> I don't think adding an additional field to _TaskExecutorHeartbeatPayload_
> should be a concern as we only heartbeat every ~ 10s by default and the new
> field would be small compared to rest of the existing payload. Also
> heartbeat doesn't need to contain the whole DT, but just some identifier
> which signals whether it uses the right one, that could be significantly
> smaller.
>
> This is still a PUSH based approach as the RM would again call the newly
> introduced _updateDelegationToken_ when it encounters inconsistency (eg.
> due to a temporary network partition / a race condition we didn't test for
> / some other scenario we didn't think about). In practice these
> inconsistencies are super hard to avoid and reason about (and unfortunately
> yes, we see them happen from time to time), so reusing the existing
> mechanism that is designed for this exact problem simplify things.
>
> To sum this up we'd have three code paths for calling
> _updateDelegationToken_:
> 1) When the TM registers, we push the token (if DTM already has it) to it
> 2) When DTM obtains a new token it broadcasts it to all currently connected
> TMs
> 3) When a TM gets out of sync, DTM would reconcile it's state
>
> WDYT?
>
> Best,
> D.
>
>
> On Wed, Jan 26, 2022 at 9:03 PM David Morávek  wrote:
>
> > Thanks the update, I'll go over it tomorrow.
> >
> > On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi  >
> > wrote:
> >
> >> Hi All,
> >>
> >> Since it has turned out that DTM can't be added as member of JobMaster
> >> <
> >>
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> >> >
> >> I've
> >> came up with a better proposal.
> >> David, thanks for pinpointing this out, you've caught a bug in the early
> >> phase!
> >>
> >> Namely ResourceManager
> >> <
> >>
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> >> >
> >> is
> >> a single instance class where DTM can be added as member variable.
> >> It has a list of all already registered TMs and new TM registration is
> >> also
> >> happening here.
> >> The following can be added from logic perspective to be more specific:
> >> * Create new DTM instance in ResourceManager
> >> <
> >>
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> >> >
> >> and
> >> start it (re-occurring thread to obtain new tokens)
> >> * Add a new function named "updateDelegationTokens" to
> TaskExecutorGateway
> >> <
> >>
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread David Morávek

Hi Gabor,

This is definitely headed in a right direction +1.

I think we still need to have a safeguard in case some of the TMs gets into
the inconsistent state though, which will also eliminate the need for
implementing a custom retry mechanism (when _updateDelegationToken_ call
fails for some reason).

We already have this safeguard in place for slot pool (in case there are
some slots in inconsistent state - eg. we haven't freed them for some
reason) and for the partition tracker, which could be simply enhanced. This
is done via periodic heartbeat from TaskManagers to the ResourceManager
that contains report about state of these two components (from TM
perspective) so the RM can reconcile their state if necessary.

I don't think adding an additional field to _TaskExecutorHeartbeatPayload_
should be a concern as we only heartbeat every ~ 10s by default and the new
field would be small compared to rest of the existing payload. Also
heartbeat doesn't need to contain the whole DT, but just some identifier
which signals whether it uses the right one, that could be significantly
smaller.

This is still a PUSH based approach as the RM would again call the newly
introduced _updateDelegationToken_ when it encounters inconsistency (eg.
due to a temporary network partition / a race condition we didn't test for
/ some other scenario we didn't think about). In practice these
inconsistencies are super hard to avoid and reason about (and unfortunately
yes, we see them happen from time to time), so reusing the existing
mechanism that is designed for this exact problem simplify things.

To sum this up we'd have three code paths for calling
_updateDelegationToken_:
1) When the TM registers, we push the token (if DTM already has it) to it
2) When DTM obtains a new token it broadcasts it to all currently connected
TMs
3) When a TM gets out of sync, DTM would reconcile it's state

WDYT?

Best,
D.

On Wed, Jan 26, 2022 at 9:03 PM David Morávek  wrote:

> Thanks the update, I'll go over it tomorrow.
>
> On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi 
> wrote:
>
>> Hi All,
>>
>> Since it has turned out that DTM can't be added as member of JobMaster
>> <
>> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
>> >
>> I've
>> came up with a better proposal.
>> David, thanks for pinpointing this out, you've caught a bug in the early
>> phase!
>>
>> Namely ResourceManager
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
>> >
>> is
>> a single instance class where DTM can be added as member variable.
>> It has a list of all already registered TMs and new TM registration is
>> also
>> happening here.
>> The following can be added from logic perspective to be more specific:
>> * Create new DTM instance in ResourceManager
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
>> >
>> and
>> start it (re-occurring thread to obtain new tokens)
>> * Add a new function named "updateDelegationTokens" to TaskExecutorGateway
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54
>> >
>> * Call "updateDelegationTokens" on all registered TMs to propagate new DTs
>> * In case of new TM registration call "updateDelegationTokens" before
>> registration succeeds to setup new TM properly
>>
>> This way:
>> * only a single DTM would live within a cluster which is the expected
>> behavior
>> * DTM is going to be added to a central place where all deployment target
>> can make use of it
>> * DTs are going to be pushed to TMs which would generate less network
>> traffic than pull based approach
>> (please see my previous mail where I've described both approaches)
>> * HA scenario is going to be consistent because such
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069
>> >
>> a solution can be added to "updateDelegationTokens"
>>
>> @David or all others plz share whether you agree on this or you have
>> better
>> idea/suggestion.
>>
>> BR,
>> G
>>
>>
>> On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi > >
>> wrote:
>>
>> > First of all thanks for investing your time and helping me out. As I see
>> > you have pretty solid knowledge in the RPC area.
>> > I would like to rely on your knowledge since I'm learning this part.
>> >
>> > > - Do we need to introduce a new RPC method or can we for example
>> > piggyback
>> > on heartbeats?
>> >
>> > I'm fine with either solution but one thing is important conceptually.
>> > There are fundamentally 2

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-26 Thread David Morávek

Thanks the update, I'll go over it tomorrow.

On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi 
wrote:

> Hi All,
>
> Since it has turned out that DTM can't be added as member of JobMaster
> <
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> >
> I've
> came up with a better proposal.
> David, thanks for pinpointing this out, you've caught a bug in the early
> phase!
>
> Namely ResourceManager
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> >
> is
> a single instance class where DTM can be added as member variable.
> It has a list of all already registered TMs and new TM registration is also
> happening here.
> The following can be added from logic perspective to be more specific:
> * Create new DTM instance in ResourceManager
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> >
> and
> start it (re-occurring thread to obtain new tokens)
> * Add a new function named "updateDelegationTokens" to TaskExecutorGateway
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54
> >
> * Call "updateDelegationTokens" on all registered TMs to propagate new DTs
> * In case of new TM registration call "updateDelegationTokens" before
> registration succeeds to setup new TM properly
>
> This way:
> * only a single DTM would live within a cluster which is the expected
> behavior
> * DTM is going to be added to a central place where all deployment target
> can make use of it
> * DTs are going to be pushed to TMs which would generate less network
> traffic than pull based approach
> (please see my previous mail where I've described both approaches)
> * HA scenario is going to be consistent because such
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069
> >
> a solution can be added to "updateDelegationTokens"
>
> @David or all others plz share whether you agree on this or you have better
> idea/suggestion.
>
> BR,
> G
>
>
> On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi 
> wrote:
>
> > First of all thanks for investing your time and helping me out. As I see
> > you have pretty solid knowledge in the RPC area.
> > I would like to rely on your knowledge since I'm learning this part.
> >
> > > - Do we need to introduce a new RPC method or can we for example
> > piggyback
> > on heartbeats?
> >
> > I'm fine with either solution but one thing is important conceptually.
> > There are fundamentally 2 ways how tokens can be updated:
> > - Push way: When there are new DTs then JM JVM pushes DTs to TM JVMs.
> This
> > is the preferred one since tiny amount of control logic needed.
> > - Pull way: Each time a TM would like to poll JM whether there are new
> > tokens and each TM wants to decide alone whether DTs needs to be updated
> or
> > not.
> > As you've mentioned here some ID needs to be generated, it would
> generated
> > quite some additional network traffic which can be definitely avoided.
> > As a final thought in Spark we've had this way of DT propagation logic
> and
> > we've had major issues with it.
> >
> > So all in all DTM needs to obtain new tokens and there must a way to send
> > this data to all TMs from JM.
> >
> > > - What delivery semantics are we looking for? (what if we're only able
> to
> > update subset of TMs / what happens if we exhaust retries / should we
> even
> > have the retry mechanism whatsoever) - I have a feeling that somehow
> > leveraging the existing heartbeat mechanism could help to answer these
> > questions
> >
> > Let's go through these questions one by one.
> > > What delivery semantics are we looking for?
> >
> > DTM must receive an exception when at least one TM was not able to get
> DTs.
> >
> > > what if we're only able to update subset of TMs?
> >
> > Such case DTM will reschedule token obtain after
> > "security.kerberos.tokens.retry-wait" time.
> >
> > > what happens if we exhaust retries?
> >
> > There is no number of retries. In default configuration tokens needs to
> be
> > re-obtained after one day.
> > DTM tries to obtain new tokens after 1day * 0.75
> > (security.kerberos.tokens.renewal-ratio) = 18 hours.
> > When fails it retries after "security.kerberos.tokens.retry-wait" which
> is
> > 1 hour by default.
> > If it never succeeds then authentication error is going to happen on the
> > TM side and the workload is
> > going to stop.
> >
> > > should we even have the retry mechanism whatsoever?
> >
> > Yes, because there are always temporary cluster issues.
> >
> >

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-26 Thread Gabor Somogyi

Hi All,

Since it has turned out that DTM can't be added as member of JobMaster

I've
came up with a better proposal.
David, thanks for pinpointing this out, you've caught a bug in the early
phase!

Namely ResourceManager

is
a single instance class where DTM can be added as member variable.
It has a list of all already registered TMs and new TM registration is also
happening here.
The following can be added from logic perspective to be more specific:
* Create new DTM instance in ResourceManager

and
start it (re-occurring thread to obtain new tokens)
* Add a new function named "updateDelegationTokens" to TaskExecutorGateway

* Call "updateDelegationTokens" on all registered TMs to propagate new DTs
* In case of new TM registration call "updateDelegationTokens" before
registration succeeds to setup new TM properly

This way:
* only a single DTM would live within a cluster which is the expected
behavior
* DTM is going to be added to a central place where all deployment target
can make use of it
* DTs are going to be pushed to TMs which would generate less network
traffic than pull based approach
(please see my previous mail where I've described both approaches)
* HA scenario is going to be consistent because such

a solution can be added to "updateDelegationTokens"

@David or all others plz share whether you agree on this or you have better
idea/suggestion.

BR,
G


On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi 
wrote:

> First of all thanks for investing your time and helping me out. As I see
> you have pretty solid knowledge in the RPC area.
> I would like to rely on your knowledge since I'm learning this part.
>
> > - Do we need to introduce a new RPC method or can we for example
> piggyback
> on heartbeats?
>
> I'm fine with either solution but one thing is important conceptually.
> There are fundamentally 2 ways how tokens can be updated:
> - Push way: When there are new DTs then JM JVM pushes DTs to TM JVMs. This
> is the preferred one since tiny amount of control logic needed.
> - Pull way: Each time a TM would like to poll JM whether there are new
> tokens and each TM wants to decide alone whether DTs needs to be updated or
> not.
> As you've mentioned here some ID needs to be generated, it would generated
> quite some additional network traffic which can be definitely avoided.
> As a final thought in Spark we've had this way of DT propagation logic and
> we've had major issues with it.
>
> So all in all DTM needs to obtain new tokens and there must a way to send
> this data to all TMs from JM.
>
> > - What delivery semantics are we looking for? (what if we're only able to
> update subset of TMs / what happens if we exhaust retries / should we even
> have the retry mechanism whatsoever) - I have a feeling that somehow
> leveraging the existing heartbeat mechanism could help to answer these
> questions
>
> Let's go through these questions one by one.
> > What delivery semantics are we looking for?
>
> DTM must receive an exception when at least one TM was not able to get DTs.
>
> > what if we're only able to update subset of TMs?
>
> Such case DTM will reschedule token obtain after
> "security.kerberos.tokens.retry-wait" time.
>
> > what happens if we exhaust retries?
>
> There is no number of retries. In default configuration tokens needs to be
> re-obtained after one day.
> DTM tries to obtain new tokens after 1day * 0.75
> (security.kerberos.tokens.renewal-ratio) = 18 hours.
> When fails it retries after "security.kerberos.tokens.retry-wait" which is
> 1 hour by default.
> If it never succeeds then authentication error is going to happen on the
> TM side and the workload is
> going to stop.
>
> > should we even have the retry mechanism whatsoever?
>
> Yes, because there are always temporary cluster issues.
>
> > What does it mean for the running application (how does this look like
> from
> the user perspective)? As far as I remember the logs are only collected
> ("aggregated") after the container is stopped, is that correct?
>
> With default config it works like that but it can be forced to aggregate
> at specific intervals.
> A useful feature is

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-25 Thread Gabor Somogyi

First of all thanks for investing your time and helping me out. As I see
you have pretty solid knowledge in the RPC area.
I would like to rely on your knowledge since I'm learning this part.

> - Do we need to introduce a new RPC method or can we for example piggyback
on heartbeats?

I'm fine with either solution but one thing is important conceptually.
There are fundamentally 2 ways how tokens can be updated:
- Push way: When there are new DTs then JM JVM pushes DTs to TM JVMs. This
is the preferred one since tiny amount of control logic needed.
- Pull way: Each time a TM would like to poll JM whether there are new
tokens and each TM wants to decide alone whether DTs needs to be updated or
not.
As you've mentioned here some ID needs to be generated, it would generated
quite some additional network traffic which can be definitely avoided.
As a final thought in Spark we've had this way of DT propagation logic and
we've had major issues with it.

So all in all DTM needs to obtain new tokens and there must a way to send
this data to all TMs from JM.

> - What delivery semantics are we looking for? (what if we're only able to
update subset of TMs / what happens if we exhaust retries / should we even
have the retry mechanism whatsoever) - I have a feeling that somehow
leveraging the existing heartbeat mechanism could help to answer these
questions

Let's go through these questions one by one.
> What delivery semantics are we looking for?

DTM must receive an exception when at least one TM was not able to get DTs.

> what if we're only able to update subset of TMs?

Such case DTM will reschedule token obtain after
"security.kerberos.tokens.retry-wait" time.

> what happens if we exhaust retries?

There is no number of retries. In default configuration tokens needs to be
re-obtained after one day.
DTM tries to obtain new tokens after 1day * 0.75
(security.kerberos.tokens.renewal-ratio) = 18 hours.
When fails it retries after "security.kerberos.tokens.retry-wait" which is
1 hour by default.
If it never succeeds then authentication error is going to happen on the TM
side and the workload is
going to stop.

> should we even have the retry mechanism whatsoever?

Yes, because there are always temporary cluster issues.

> What does it mean for the running application (how does this look like
from
the user perspective)? As far as I remember the logs are only collected
("aggregated") after the container is stopped, is that correct?

With default config it works like that but it can be forced to aggregate at
specific intervals.
A useful feature is forcing YARN to aggregate logs while the job is still
running.
For long-running jobs such as streaming jobs, this is invaluable. To do
this,
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds must be
set to a non-negative value.
When this is set, a timer will be set for the given duration, and whenever
that timer goes off,
log aggregation will run on new files.

> I think
this topic should get its own section in the FLIP (having some cross
reference to YARN ticket would be really useful, but I'm not sure if there
are any).

I think this is important knowledge but this FLIP is not touching the
already existing behavior.
DTs are set on the AM container which is renewed by YARN until it's not
possible anymore.
Any kind of new code is not going to change this limitation. BTW, there is
no jira for this.
If you think it worth to write this down then I think the good place is the
official security doc
area as caveat.

> If we split the FLIP into two parts / sections that I've suggested, I
don't
really think that you need to explicitly test for each deployment scenario
/ cluster framework, because the DTM part is completely independent of the
deployment target. Basically this is what I'm aiming for with "making it
work with the standalone" (as simple as starting a new java process) Flink
first (which is also how most people deploy streaming application on k8s
and the direction we're pushing forward with the auto-scaling / reactive
mode initiatives).

I see your point and agree the main direction. k8s is the megatrend which
most of the peoples
will use sooner or later. Not 100% sure what kind of split you suggest but
in my view
the main target is to add this feature and I'm open to any logical work
ordering.
Please share the specific details and we work it out...

G

On Mon, Jan 24, 2022 at 3:04 PM David Morávek  wrote:

> >
> > Could you point to a code where you think it could be added exactly? A
> > helping hand is welcome here 
> >
>
> I think you can take a look at _ResourceManagerPartitionTracker_ [1] which
> seems to have somewhat similar properties to the DTM.
>
> One topic that needs to be addressed there is how the RPC with the
> _TaskExecutorGateway_ should look like.
> - Do we need to introduce a new RPC method or can we for example piggyback
> on heartbeats?
> - What delivery semantics are we looking for? (what if we're only able to
> update subset of TMs /

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek

>
> Do we need to introduce a new RPC method or can we for example piggyback
> on heartbeats?
>

Seems we can use the very same approach as
_ResourceManagerPartitionTracker_ is using:
- _TaskManagers_ periodically report which token they're using (eg.
identified by some id). This involves adding a new field into
_TaskExecutorHeartbeatPayload_.
- Once report arrives, DTM checks the token and updates it if necessary
(we'd introduce a new method for that on TaskExecutorGateway).
- If update fails, we don't need to retry. The next heartbeat takes care of
that.
- Heartbeat mechanism already covers TM failure scenarios

On Mon, Jan 24, 2022 at 3:03 PM David Morávek  wrote:

> Could you point to a code where you think it could be added exactly? A
>> helping hand is welcome here 
>>
>
> I think you can take a look at _ResourceManagerPartitionTracker_ [1] which
> seems to have somewhat similar properties to the DTM.
>
> One topic that needs to be addressed there is how the RPC with the
> _TaskExecutorGateway_ should look like.
> - Do we need to introduce a new RPC method or can we for example piggyback
> on heartbeats?
> - What delivery semantics are we looking for? (what if we're only able to
> update subset of TMs / what happens if we exhaust retries / should we even
> have the retry mechanism whatsoever) - I have a feeling that somehow
> leveraging the existing heartbeat mechanism could help to answer these
> questions
>
> In short, after DT reaches it's max lifetime then log aggregation stops
>>
>
> What does it mean for the running application (how does this look like
> from the user perspective)? As far as I remember the logs are only
> collected ("aggregated") after the container is stopped, is that correct? I
> think this topic should get its own section in the FLIP (having some cross
> reference to YARN ticket would be really useful, but I'm not sure if there
> are any).
>
> All deployment modes (per-job, per-app, ...) are planned to be tested and
>> expect to work with the initial implementation however not all deployment
>> targets (k8s, local, ...
>>
>
> If we split the FLIP into two parts / sections that I've suggested, I
> don't really think that you need to explicitly test for each deployment
> scenario / cluster framework, because the DTM part is completely
> independent of the deployment target. Basically this is what I'm aiming for
> with "making it work with the standalone" (as simple as starting a new java
> process) Flink first (which is also how most people deploy streaming
> application on k8s and the direction we're pushing forward with the
> auto-scaling / reactive mode initiatives).
>
> The whole integration with YARN (let's forget about log aggregation for a
> moment) / k8s-native only boils down to how do we make the keytab file
> local to the JobManager so the DTM can read it, so it's basically built on
> top of that. The only special thing that needs to be tested there is the
> "keytab distribution" code path.
>
> [1]
> https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java
>
> Best,
> D.
>
> On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi 
> wrote:
>
>> > There is a separate JobMaster for each job
>> within a Flink cluster and each JobMaster only has a partial view of the
>> task managers
>>
>> Good point! I've had a deeper look and you're right. We definitely need to
>> find another place.
>>
>> > Related per-cluster or per-job keytab:
>>
>> In the current code per-cluster keytab is implemented and I'm intended to
>> keep it like this within this FLIP. The reason is simple: tokens on TM
>> side
>> can be stored within the UserGroupInformation (UGI) structure which is
>> global. I'm not telling it's impossible to change that but I think that
>> this is such a complexity which the initial implementation is not required
>> to contain. Additionally we've not seen such need from user side. If the
>> need may rise later on then another FLIP with this topic can be created
>> and
>> discussed. Proper multi-UGI handling within a single JVM is a topic where
>> several round of deep-dive with the Hadoop/YARN guys are required.
>>
>> > single DTM instance embedded with
>> the ResourceManager (the Flink component)
>>
>> Could you point to a code where you think it could be added exactly? A
>> helping hand is welcome here
>>
>> > Then the single (initial) implementation should work with all the
>> deployments modes out of the box (which is not what the FLIP suggests). Is
>> that correct?
>>
>> All deployment modes (per-job, per-app, ...) are planned to be tested and
>> expect to work with the initial implementation however not all deployment
>> targets (k8s, local, ...) are not intended to be tested. Per deployment
>> target new jira needs to be created where I expect small number of codes
>> needs to be added and relatively expensive testing effort is required.
>>
>> > I've taken a look

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek

>
> Could you point to a code where you think it could be added exactly? A
> helping hand is welcome here 
>

I think you can take a look at _ResourceManagerPartitionTracker_ [1] which
seems to have somewhat similar properties to the DTM.

One topic that needs to be addressed there is how the RPC with the
_TaskExecutorGateway_ should look like.
- Do we need to introduce a new RPC method or can we for example piggyback
on heartbeats?
- What delivery semantics are we looking for? (what if we're only able to
update subset of TMs / what happens if we exhaust retries / should we even
have the retry mechanism whatsoever) - I have a feeling that somehow
leveraging the existing heartbeat mechanism could help to answer these
questions

In short, after DT reaches it's max lifetime then log aggregation stops
>

What does it mean for the running application (how does this look like from
the user perspective)? As far as I remember the logs are only collected
("aggregated") after the container is stopped, is that correct? I think
this topic should get its own section in the FLIP (having some cross
reference to YARN ticket would be really useful, but I'm not sure if there
are any).

All deployment modes (per-job, per-app, ...) are planned to be tested and
> expect to work with the initial implementation however not all deployment
> targets (k8s, local, ...
>

If we split the FLIP into two parts / sections that I've suggested, I don't
really think that you need to explicitly test for each deployment scenario
/ cluster framework, because the DTM part is completely independent of the
deployment target. Basically this is what I'm aiming for with "making it
work with the standalone" (as simple as starting a new java process) Flink
first (which is also how most people deploy streaming application on k8s
and the direction we're pushing forward with the auto-scaling / reactive
mode initiatives).

The whole integration with YARN (let's forget about log aggregation for a
moment) / k8s-native only boils down to how do we make the keytab file
local to the JobManager so the DTM can read it, so it's basically built on
top of that. The only special thing that needs to be tested there is the
"keytab distribution" code path.

[1]
https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java

Best,
D.

On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi 
wrote:

> > There is a separate JobMaster for each job
> within a Flink cluster and each JobMaster only has a partial view of the
> task managers
>
> Good point! I've had a deeper look and you're right. We definitely need to
> find another place.
>
> > Related per-cluster or per-job keytab:
>
> In the current code per-cluster keytab is implemented and I'm intended to
> keep it like this within this FLIP. The reason is simple: tokens on TM side
> can be stored within the UserGroupInformation (UGI) structure which is
> global. I'm not telling it's impossible to change that but I think that
> this is such a complexity which the initial implementation is not required
> to contain. Additionally we've not seen such need from user side. If the
> need may rise later on then another FLIP with this topic can be created and
> discussed. Proper multi-UGI handling within a single JVM is a topic where
> several round of deep-dive with the Hadoop/YARN guys are required.
>
> > single DTM instance embedded with
> the ResourceManager (the Flink component)
>
> Could you point to a code where you think it could be added exactly? A
> helping hand is welcome here
>
> > Then the single (initial) implementation should work with all the
> deployments modes out of the box (which is not what the FLIP suggests). Is
> that correct?
>
> All deployment modes (per-job, per-app, ...) are planned to be tested and
> expect to work with the initial implementation however not all deployment
> targets (k8s, local, ...) are not intended to be tested. Per deployment
> target new jira needs to be created where I expect small number of codes
> needs to be added and relatively expensive testing effort is required.
>
> > I've taken a look into the prototype and in the "YarnClusterDescriptor"
> you're injecting a delegation token into the AM [1] (that's obtained using
> the provided keytab). If I understand this correctly from previous
> discussion / FLIP, this is to support log aggregation and DT has a limited
> validity. How is this DT going to be renewed?
>
> You're clever and touched a limitation which Spark has too. In short, after
> DT reaches it's max lifetime then log aggregation stops. I've had several
> deep-dive rounds with the YARN guys at Spark years because wanted to fill
> this gap. They can't provide us any way to re-inject the newly obtained DT
> so at the end I gave up this.
>
> BR,
> G
>
>
> On Mon, 24 Jan 2022, 11:00 David Morávek,  wrote:
>
> > Hi Gabor,
> >
> > There is actually a huge difference between

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread Gabor Somogyi

> There is a separate JobMaster for each job
within a Flink cluster and each JobMaster only has a partial view of the
task managers

Good point! I've had a deeper look and you're right. We definitely need to
find another place.

> Related per-cluster or per-job keytab:

In the current code per-cluster keytab is implemented and I'm intended to
keep it like this within this FLIP. The reason is simple: tokens on TM side
can be stored within the UserGroupInformation (UGI) structure which is
global. I'm not telling it's impossible to change that but I think that
this is such a complexity which the initial implementation is not required
to contain. Additionally we've not seen such need from user side. If the
need may rise later on then another FLIP with this topic can be created and
discussed. Proper multi-UGI handling within a single JVM is a topic where
several round of deep-dive with the Hadoop/YARN guys are required.

> single DTM instance embedded with
the ResourceManager (the Flink component)

Could you point to a code where you think it could be added exactly? A
helping hand is welcome here

> Then the single (initial) implementation should work with all the
deployments modes out of the box (which is not what the FLIP suggests). Is
that correct?

All deployment modes (per-job, per-app, ...) are planned to be tested and
expect to work with the initial implementation however not all deployment
targets (k8s, local, ...) are not intended to be tested. Per deployment
target new jira needs to be created where I expect small number of codes
needs to be added and relatively expensive testing effort is required.

> I've taken a look into the prototype and in the "YarnClusterDescriptor"
you're injecting a delegation token into the AM [1] (that's obtained using
the provided keytab). If I understand this correctly from previous
discussion / FLIP, this is to support log aggregation and DT has a limited
validity. How is this DT going to be renewed?

You're clever and touched a limitation which Spark has too. In short, after
DT reaches it's max lifetime then log aggregation stops. I've had several
deep-dive rounds with the YARN guys at Spark years because wanted to fill
this gap. They can't provide us any way to re-inject the newly obtained DT
so at the end I gave up this.

BR,
G

On Mon, 24 Jan 2022, 11:00 David Morávek,  wrote:

> Hi Gabor,
>
> There is actually a huge difference between JobManager (process) and
> JobMaster (job coordinator). The naming is unfortunately bit misleading
> here from historical reasons. There is a separate JobMaster for each job
> within a Flink cluster and each JobMaster only has a partial view of the
> task managers (depends on where the slots for a particular job are
> allocated). This means that you'll end up with N "DelegationTokenManagers"
> competing with each other (N = number of running jobs in the cluster).
>
> This makes me think we're mixing two abstraction levels here:
>
> a) Per-cluster delegation tokens
> - Simpler approach, it would involve a single DTM instance embedded with
> the ResourceManager (the Flink component)
> b) Per-job delegation tokens
> - More complex approach, but could be more flexible from the user side of
> things.
> - Multiple DTM instances, that are bound with the JobMaster lifecycle.
> Delegation tokens are attached with a particular slots that are executing
> the job tasks instead of the whole task manager (TM could be executing
> multiple jobs with different tokens).
> - The question is which keytab should be used for the clustering framework,
> to support log aggregation on YARN (an extra keytab, keytab that comes with
> the first job?)
>
> I think these are the things that need to be clarified in the FLIP before
> proceeding.
>
> A follow-up question for getting a better understanding where this should
> be headed: Are there any use cases where user may want to use different
> keytabs with each job, or are we fine with using a cluster-wide keytab? If
> we go with per-cluster keytabs, is it OK that all jobs submitted into this
> cluster can access it (even the future ones)? Should this be a security
> concern?
>
> Presume you though I would implement a new class with JobManager name. The
> > plan is not that.
> >
>
> I've never suggested such thing.
>
>
> > No. That said earlier DT handling is planned to be done completely in
> > Flink. DTM has a renewal thread which re-obtains tokens in the proper
> time
> > when needed.
> >
>
> Then the single (initial) implementation should work with all the
> deployments modes out of the box (which is not what the FLIP suggests). Is
> that correct?
>
> If the cluster framework, also requires delegation token for their inner
> working (this is IMO only applies to YARN), it might need an extra step
> (injecting the token into application master container).
>
> Separating the individual layers (actual Flink cluster - basically making
> this work with a standalone deployment  / "cluster framework" - support for

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek

Hi Gabor,

There is actually a huge difference between JobManager (process) and
JobMaster (job coordinator). The naming is unfortunately bit misleading
here from historical reasons. There is a separate JobMaster for each job
within a Flink cluster and each JobMaster only has a partial view of the
task managers (depends on where the slots for a particular job are
allocated). This means that you'll end up with N "DelegationTokenManagers"
competing with each other (N = number of running jobs in the cluster).

This makes me think we're mixing two abstraction levels here:

a) Per-cluster delegation tokens
- Simpler approach, it would involve a single DTM instance embedded with
the ResourceManager (the Flink component)
b) Per-job delegation tokens
- More complex approach, but could be more flexible from the user side of
things.
- Multiple DTM instances, that are bound with the JobMaster lifecycle.
Delegation tokens are attached with a particular slots that are executing
the job tasks instead of the whole task manager (TM could be executing
multiple jobs with different tokens).
- The question is which keytab should be used for the clustering framework,
to support log aggregation on YARN (an extra keytab, keytab that comes with
the first job?)

I think these are the things that need to be clarified in the FLIP before
proceeding.

A follow-up question for getting a better understanding where this should
be headed: Are there any use cases where user may want to use different
keytabs with each job, or are we fine with using a cluster-wide keytab? If
we go with per-cluster keytabs, is it OK that all jobs submitted into this
cluster can access it (even the future ones)? Should this be a security
concern?

Presume you though I would implement a new class with JobManager name. The
> plan is not that.
>

I've never suggested such thing.

> No. That said earlier DT handling is planned to be done completely in
> Flink. DTM has a renewal thread which re-obtains tokens in the proper time
> when needed.
>

Then the single (initial) implementation should work with all the
deployments modes out of the box (which is not what the FLIP suggests). Is
that correct?

If the cluster framework, also requires delegation token for their inner
working (this is IMO only applies to YARN), it might need an extra step
(injecting the token into application master container).

Separating the individual layers (actual Flink cluster - basically making
this work with a standalone deployment  / "cluster framework" - support for
YARN log aggregation) in the FLIP would be useful.

Reading the linked Spark readme could be useful.
>

I've read that, but please be patient with the questions, Kerberos is not
an easy topic to get into and I've had a very little contact with it in the
past.

https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
>

I've taken a look into the prototype and in the "YarnClusterDescriptor"
you're injecting a delegation token into the AM [1] (that's obtained using
the provided keytab). If I understand this correctly from previous
discussion / FLIP, this is to support log aggregation and DT has a limited
validity. How is this DT going to be renewed?

[1]
https://github.com/gaborgsomogyi/flink/commit/8ab75e46013f159778ccfce52463e7bc63e395a9#diff-02416e2d6ca99e1456f9c3949f3d7c2ac523d3fe25378620c09632e4aac34e4eR1261

Best,
D.

On Fri, Jan 21, 2022 at 9:35 PM Gabor Somogyi 
wrote:

> Here is the exact class, I'm from mobile so not had a look at the exact
> class name:
>
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> That keeps track of TMs where the tokens can be sent to.
>
> > My feeling would be that we shouldn't really introduce a new component
> with
> a custom lifecycle, but rather we should try to incorporate this into
> existing ones.
>
> Can you be more specific? Presume you though I would implement a new class
> with JobManager name. The plan is not that.
>
> > If I understand this correctly, this means that we then push the token
> renewal logic to YARN.
>
> No. That said earlier DT handling is planned to be done completely in
> Flink. DTM has a renewal thread which re-obtains tokens in the proper time
> when needed. YARN log aggregation is a totally different feature, where
> YARN does the renewal. Log aggregation was an example why the code can't be
> 100% reusable for all resource managers. Reading the linked Spark readme
> could be useful.
>
> G
>
> On Fri, 21 Jan 2022, 21:05 David Morávek,  wrote:
>
> > >
> > > JobManager is the Flink class.
> >
> >
> > There is no such class in Flink. The closest thing to the JobManager is a
> > ClusterEntrypoint. The cluster entrypoint spawns new RM Runner &
> Dispatcher
> > Runner that start participating in the leader election. Once they gain
> >

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Here is the exact class, I'm from mobile so not had a look at the exact
class name:
https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
That keeps track of TMs where the tokens can be sent to.

> My feeling would be that we shouldn't really introduce a new component
with
a custom lifecycle, but rather we should try to incorporate this into
existing ones.

Can you be more specific? Presume you though I would implement a new class
with JobManager name. The plan is not that.

> If I understand this correctly, this means that we then push the token
renewal logic to YARN.

No. That said earlier DT handling is planned to be done completely in
Flink. DTM has a renewal thread which re-obtains tokens in the proper time
when needed. YARN log aggregation is a totally different feature, where
YARN does the renewal. Log aggregation was an example why the code can't be
100% reusable for all resource managers. Reading the linked Spark readme
could be useful.

G

On Fri, 21 Jan 2022, 21:05 David Morávek,  wrote:

> >
> > JobManager is the Flink class.
>
>
> There is no such class in Flink. The closest thing to the JobManager is a
> ClusterEntrypoint. The cluster entrypoint spawns new RM Runner & Dispatcher
> Runner that start participating in the leader election. Once they gain
> leadership they spawn the actual underlying instances of these two "main
> components".
>
> My feeling would be that we shouldn't really introduce a new component with
> a custom lifecycle, but rather we should try to incorporate this into
> existing ones.
>
> My biggest concerns would be:
>
> - How would the lifecycle of the new component look like with regards to HA
> setups. If we really try to decide to introduce a completely new component,
> how should this work in case of multiple JobManager instances?
> - Which components does it talk to / how? For example how does the
> broadcast of new token to task managers (TaskManagerGateway) look like? Do
> we simply introduce a new RPC on the ResourceManagerGateway that broadcasts
> it or does the new component need to do some kind of bookkeeping of task
> managers that it needs to notify?
>
> YARN based HDFS log aggregation would not work by dropping that code. Just
> > to be crystal clear, the actual implementation contains this fir exactly
> > this reason.
> >
>
> This is the missing part +1. If I understand this correctly, this means
> that we then push the token renewal logic to YARN. How do you plan to
> implement the renewal logic on k8s?
>
> D.
>
> On Fri, Jan 21, 2022 at 8:37 PM Gabor Somogyi 
> wrote:
>
> > > I think we might both mean something different by the RM.
> >
> > You feel it well, I've not specified these terms well in the explanation.
> > RM I meant resource management framework. JobManager is the Flink class.
> > This means that inside JM instance there will be a DTM instance, so they
> > would have the same lifecycle. Hope I've answered the question.
> >
> > > If we have tokens available on the client side, why do we need to set
> > them
> > into the AM (yarn specific concept) launch context?
> >
> > YARN based HDFS log aggregation would not work by dropping that code.
> Just
> > to be crystal clear, the actual implementation contains this fir exactly
> > this reason.
> >
> > G
> >
> > On Fri, 21 Jan 2022, 20:12 David Morávek,  wrote:
> >
> > > Hi Gabor,
> > >
> > > 1. One thing is important, token management is planned to be done
> > > > generically within Flink and not scattered in RM specific code.
> > > JobManager
> > > > has a DelegationTokenManager which obtains tokens time-to-time (if
> > > > configured properly). JM knows which TaskManagers are in place so it
> > can
> > > > distribute it to all TMs. That's it basically.
> > >
> > >
> > > I think we might both mean something different by the RM. JobManager is
> > > basically just a process encapsulating multiple components, one of
> which
> > is
> > > a ResourceManager, which is the component that manages task manager
> > > registrations [1]. There is more or less a single implementation of the
> > RM
> > > with plugable drivers for the active integrations (yarn, k8s).
> > >
> > > It would be great if you could share more details of how exactly the
> DTM
> > is
> > > going to fit in the current JM architecture.
> > >
> > > 2. 99.9% of the code is generic but each RM handles tokens
> differently. A
> > > > good example is YARN obtains tokens on client side and then sets them
> > on
> > > > the newly created AM container launch context. This is purely YARN
> > > specific
> > > > and cant't be spared. With my actual plans standalone can be changed
> to
> > > use
> > > > the framework. By using it I mean no RM specific DTM or whatsoever is
> > > > needed.
> > > >
> > >
> > > If we have tokens available on the client side, why do we need to set
> > them
> > > into the AM (yarn specific concept) launch

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-21 Thread David Morávek

>
> JobManager is the Flink class.


There is no such class in Flink. The closest thing to the JobManager is a
ClusterEntrypoint. The cluster entrypoint spawns new RM Runner & Dispatcher
Runner that start participating in the leader election. Once they gain
leadership they spawn the actual underlying instances of these two "main
components".

My feeling would be that we shouldn't really introduce a new component with
a custom lifecycle, but rather we should try to incorporate this into
existing ones.

My biggest concerns would be:

- How would the lifecycle of the new component look like with regards to HA
setups. If we really try to decide to introduce a completely new component,
how should this work in case of multiple JobManager instances?
- Which components does it talk to / how? For example how does the
broadcast of new token to task managers (TaskManagerGateway) look like? Do
we simply introduce a new RPC on the ResourceManagerGateway that broadcasts
it or does the new component need to do some kind of bookkeeping of task
managers that it needs to notify?

YARN based HDFS log aggregation would not work by dropping that code. Just
> to be crystal clear, the actual implementation contains this fir exactly
> this reason.
>

This is the missing part +1. If I understand this correctly, this means
that we then push the token renewal logic to YARN. How do you plan to
implement the renewal logic on k8s?

D.

On Fri, Jan 21, 2022 at 8:37 PM Gabor Somogyi 
wrote:

> > I think we might both mean something different by the RM.
>
> You feel it well, I've not specified these terms well in the explanation.
> RM I meant resource management framework. JobManager is the Flink class.
> This means that inside JM instance there will be a DTM instance, so they
> would have the same lifecycle. Hope I've answered the question.
>
> > If we have tokens available on the client side, why do we need to set
> them
> into the AM (yarn specific concept) launch context?
>
> YARN based HDFS log aggregation would not work by dropping that code. Just
> to be crystal clear, the actual implementation contains this fir exactly
> this reason.
>
> G
>
> On Fri, 21 Jan 2022, 20:12 David Morávek,  wrote:
>
> > Hi Gabor,
> >
> > 1. One thing is important, token management is planned to be done
> > > generically within Flink and not scattered in RM specific code.
> > JobManager
> > > has a DelegationTokenManager which obtains tokens time-to-time (if
> > > configured properly). JM knows which TaskManagers are in place so it
> can
> > > distribute it to all TMs. That's it basically.
> >
> >
> > I think we might both mean something different by the RM. JobManager is
> > basically just a process encapsulating multiple components, one of which
> is
> > a ResourceManager, which is the component that manages task manager
> > registrations [1]. There is more or less a single implementation of the
> RM
> > with plugable drivers for the active integrations (yarn, k8s).
> >
> > It would be great if you could share more details of how exactly the DTM
> is
> > going to fit in the current JM architecture.
> >
> > 2. 99.9% of the code is generic but each RM handles tokens differently. A
> > > good example is YARN obtains tokens on client side and then sets them
> on
> > > the newly created AM container launch context. This is purely YARN
> > specific
> > > and cant't be spared. With my actual plans standalone can be changed to
> > use
> > > the framework. By using it I mean no RM specific DTM or whatsoever is
> > > needed.
> > >
> >
> > If we have tokens available on the client side, why do we need to set
> them
> > into the AM (yarn specific concept) launch context? Why can't we simply
> > send them to the JM, eg. as a parameter of the job submission / via
> > separate RPC call? There might be something I'm missing due to limited
> > knowledge, but handling the token on the "cluster framework" level
> doesn't
> > seem necessary.
> >
> > [1]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager
> >
> > Best,
> > D.
> >
> > On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi  >
> > wrote:
> >
> > > Oh and one more thing. I'm planning to add this feature in small chunk
> of
> > > PRs because security is super hairy area. That way reviewers can be
> more
> > > easily obtains the concept.
> > >
> > > On Fri, 21 Jan 2022, 18:03 David Morávek,  wrote:
> > >
> > > > Hi Gabor,
> > > >
> > > > thanks for drafting the FLIP, I think having a solid Kerberos support
> > is
> > > > crucial for many enterprise deployments.
> > > >
> > > > I have multiple questions regarding the implementation (note that I
> > have
> > > > very limited knowledge of Kerberos):
> > > >
> > > > 1) If I understand it correctly, we'll only obtain tokens in the job
> > > > manager and then we'll distribute them via RPC (needs to be secured).
> > > >
> > > > Can you please outline how the communication will look like? Is the
> > > >

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

> I think we might both mean something different by the RM.

You feel it well, I've not specified these terms well in the explanation.
RM I meant resource management framework. JobManager is the Flink class.
This means that inside JM instance there will be a DTM instance, so they
would have the same lifecycle. Hope I've answered the question.

> If we have tokens available on the client side, why do we need to set them
into the AM (yarn specific concept) launch context?

YARN based HDFS log aggregation would not work by dropping that code. Just
to be crystal clear, the actual implementation contains this fir exactly
this reason.

G

On Fri, 21 Jan 2022, 20:12 David Morávek,  wrote:

> Hi Gabor,
>
> 1. One thing is important, token management is planned to be done
> > generically within Flink and not scattered in RM specific code.
> JobManager
> > has a DelegationTokenManager which obtains tokens time-to-time (if
> > configured properly). JM knows which TaskManagers are in place so it can
> > distribute it to all TMs. That's it basically.
>
>
> I think we might both mean something different by the RM. JobManager is
> basically just a process encapsulating multiple components, one of which is
> a ResourceManager, which is the component that manages task manager
> registrations [1]. There is more or less a single implementation of the RM
> with plugable drivers for the active integrations (yarn, k8s).
>
> It would be great if you could share more details of how exactly the DTM is
> going to fit in the current JM architecture.
>
> 2. 99.9% of the code is generic but each RM handles tokens differently. A
> > good example is YARN obtains tokens on client side and then sets them on
> > the newly created AM container launch context. This is purely YARN
> specific
> > and cant't be spared. With my actual plans standalone can be changed to
> use
> > the framework. By using it I mean no RM specific DTM or whatsoever is
> > needed.
> >
>
> If we have tokens available on the client side, why do we need to set them
> into the AM (yarn specific concept) launch context? Why can't we simply
> send them to the JM, eg. as a parameter of the job submission / via
> separate RPC call? There might be something I'm missing due to limited
> knowledge, but handling the token on the "cluster framework" level doesn't
> seem necessary.
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager
>
> Best,
> D.
>
> On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi 
> wrote:
>
> > Oh and one more thing. I'm planning to add this feature in small chunk of
> > PRs because security is super hairy area. That way reviewers can be more
> > easily obtains the concept.
> >
> > On Fri, 21 Jan 2022, 18:03 David Morávek,  wrote:
> >
> > > Hi Gabor,
> > >
> > > thanks for drafting the FLIP, I think having a solid Kerberos support
> is
> > > crucial for many enterprise deployments.
> > >
> > > I have multiple questions regarding the implementation (note that I
> have
> > > very limited knowledge of Kerberos):
> > >
> > > 1) If I understand it correctly, we'll only obtain tokens in the job
> > > manager and then we'll distribute them via RPC (needs to be secured).
> > >
> > > Can you please outline how the communication will look like? Is the
> > > DelegationTokenManager going to be a part of the ResourceManager? Can
> you
> > > outline it's lifecycle / how it's going to be integrated there?
> > >
> > > 2) Do we really need a YARN / k8s specific implementations? Is it
> > possible
> > > to obtain / renew a token in a generic way? Maybe to rephrase that, is
> it
> > > possible to implement DelegationTokenManager for the standalone Flink?
> If
> > > we're able to solve this point, it could be possible to target all
> > > deployment scenarios with a single implementation.
> > >
> > > Best,
> > > D.
> > >
> > > On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang 
> > > wrote:
> > >
> > > > Hi G
> > > >
> > > > Thanks for your explain in detail. I have gotten your thoughts, and
> any
> > > > way this proposal
> > > > is a great improvement.
> > > >
> > > > Looking forward to your implementation and i will keep focus on it.
> > > > Thanks again.
> > > >
> > > > Best
> > > > JunFan.
> > > > On Jan 13, 2022, 9:20 PM +0800, Gabor Somogyi <
> > gabor.g.somo...@gmail.com
> > > >,
> > > > wrote:
> > > > > Just to confirm keeping "security.kerberos.fetch.delegation-token"
> is
> > > > added
> > > > > to the doc.
> > > > >
> > > > > BR,
> > > > > G
> > > > >
> > > > >
> > > > > On Thu, Jan 13, 2022 at 1:34 PM Gabor Somogyi <
> > > gabor.g.somo...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi JunFan,
> > > > > >
> > > > > > > By the way, maybe this should be added in the migration plan or
> > > > > > intergation section in the FLIP-211.
> > > > > >
> > > > > > Going to add this soon.
> > > > > >
> > > > > > > Besides, I have a question that the KDC will collapse when the
> > > > cluster
> > > > > > reached

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-21 Thread David Morávek

Hi Gabor,

1. One thing is important, token management is planned to be done
> generically within Flink and not scattered in RM specific code. JobManager
> has a DelegationTokenManager which obtains tokens time-to-time (if
> configured properly). JM knows which TaskManagers are in place so it can
> distribute it to all TMs. That's it basically.


I think we might both mean something different by the RM. JobManager is
basically just a process encapsulating multiple components, one of which is
a ResourceManager, which is the component that manages task manager
registrations [1]. There is more or less a single implementation of the RM
with plugable drivers for the active integrations (yarn, k8s).

It would be great if you could share more details of how exactly the DTM is
going to fit in the current JM architecture.

2. 99.9% of the code is generic but each RM handles tokens differently. A
> good example is YARN obtains tokens on client side and then sets them on
> the newly created AM container launch context. This is purely YARN specific
> and cant't be spared. With my actual plans standalone can be changed to use
> the framework. By using it I mean no RM specific DTM or whatsoever is
> needed.
>

If we have tokens available on the client side, why do we need to set them
into the AM (yarn specific concept) launch context? Why can't we simply
send them to the JM, eg. as a parameter of the job submission / via
separate RPC call? There might be something I'm missing due to limited
knowledge, but handling the token on the "cluster framework" level doesn't
seem necessary.

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager

Best,
D.

On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi 
wrote:

> Oh and one more thing. I'm planning to add this feature in small chunk of
> PRs because security is super hairy area. That way reviewers can be more
> easily obtains the concept.
>
> On Fri, 21 Jan 2022, 18:03 David Morávek,  wrote:
>
> > Hi Gabor,
> >
> > thanks for drafting the FLIP, I think having a solid Kerberos support is
> > crucial for many enterprise deployments.
> >
> > I have multiple questions regarding the implementation (note that I have
> > very limited knowledge of Kerberos):
> >
> > 1) If I understand it correctly, we'll only obtain tokens in the job
> > manager and then we'll distribute them via RPC (needs to be secured).
> >
> > Can you please outline how the communication will look like? Is the
> > DelegationTokenManager going to be a part of the ResourceManager? Can you
> > outline it's lifecycle / how it's going to be integrated there?
> >
> > 2) Do we really need a YARN / k8s specific implementations? Is it
> possible
> > to obtain / renew a token in a generic way? Maybe to rephrase that, is it
> > possible to implement DelegationTokenManager for the standalone Flink? If
> > we're able to solve this point, it could be possible to target all
> > deployment scenarios with a single implementation.
> >
> > Best,
> > D.
> >
> > On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang 
> > wrote:
> >
> > > Hi G
> > >
> > > Thanks for your explain in detail. I have gotten your thoughts, and any
> > > way this proposal
> > > is a great improvement.
> > >
> > > Looking forward to your implementation and i will keep focus on it.
> > > Thanks again.
> > >
> > > Best
> > > JunFan.
> > > On Jan 13, 2022, 9:20 PM +0800, Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >,
> > > wrote:
> > > > Just to confirm keeping "security.kerberos.fetch.delegation-token" is
> > > added
> > > > to the doc.
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > > On Thu, Jan 13, 2022 at 1:34 PM Gabor Somogyi <
> > gabor.g.somo...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi JunFan,
> > > > >
> > > > > > By the way, maybe this should be added in the migration plan or
> > > > > intergation section in the FLIP-211.
> > > > >
> > > > > Going to add this soon.
> > > > >
> > > > > > Besides, I have a question that the KDC will collapse when the
> > > cluster
> > > > > reached 200 nodes you described
> > > > > in the google doc. Do you have any attachment or reference to prove
> > it?
> > > > >
> > > > > "KDC *may* collapse under some circumstances" is the proper
> wording.
> > > > >
> > > > > We have several customers who are executing workloads on
> Spark/Flink.
> > > Most
> > > > > of the time I'm facing their
> > > > > daily issues which is heavily environment and use-case dependent.
> > I've
> > > > > seen various cases:
> > > > > * where the mentioned ~1k nodes were working fine
> > > > > * where KDC thought the number of requests are coming from DDOS
> > attack
> > > so
> > > > > discontinued authentication
> > > > > * where KDC was simply not responding because of the load
> > > > > * where KDC was intermittently had some outage (this was the most
> > nasty
> > > > > thing)
> > > > >
> > > > > Since you're managing relatively big cluster then you know that KDC
> > is
> > > not
> > >

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Oh and one more thing. I'm planning to add this feature in small chunk of
PRs because security is super hairy area. That way reviewers can be more
easily obtains the concept.

On Fri, 21 Jan 2022, 18:03 David Morávek,  wrote:

> Hi Gabor,
>
> thanks for drafting the FLIP, I think having a solid Kerberos support is
> crucial for many enterprise deployments.
>
> I have multiple questions regarding the implementation (note that I have
> very limited knowledge of Kerberos):
>
> 1) If I understand it correctly, we'll only obtain tokens in the job
> manager and then we'll distribute them via RPC (needs to be secured).
>
> Can you please outline how the communication will look like? Is the
> DelegationTokenManager going to be a part of the ResourceManager? Can you
> outline it's lifecycle / how it's going to be integrated there?
>
> 2) Do we really need a YARN / k8s specific implementations? Is it possible
> to obtain / renew a token in a generic way? Maybe to rephrase that, is it
> possible to implement DelegationTokenManager for the standalone Flink? If
> we're able to solve this point, it could be possible to target all
> deployment scenarios with a single implementation.
>
> Best,
> D.
>
> On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang 
> wrote:
>
> > Hi G
> >
> > Thanks for your explain in detail. I have gotten your thoughts, and any
> > way this proposal
> > is a great improvement.
> >
> > Looking forward to your implementation and i will keep focus on it.
> > Thanks again.
> >
> > Best
> > JunFan.
> > On Jan 13, 2022, 9:20 PM +0800, Gabor Somogyi  >,
> > wrote:
> > > Just to confirm keeping "security.kerberos.fetch.delegation-token" is
> > added
> > > to the doc.
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Thu, Jan 13, 2022 at 1:34 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi JunFan,
> > > >
> > > > > By the way, maybe this should be added in the migration plan or
> > > > intergation section in the FLIP-211.
> > > >
> > > > Going to add this soon.
> > > >
> > > > > Besides, I have a question that the KDC will collapse when the
> > cluster
> > > > reached 200 nodes you described
> > > > in the google doc. Do you have any attachment or reference to prove
> it?
> > > >
> > > > "KDC *may* collapse under some circumstances" is the proper wording.
> > > >
> > > > We have several customers who are executing workloads on Spark/Flink.
> > Most
> > > > of the time I'm facing their
> > > > daily issues which is heavily environment and use-case dependent.
> I've
> > > > seen various cases:
> > > > * where the mentioned ~1k nodes were working fine
> > > > * where KDC thought the number of requests are coming from DDOS
> attack
> > so
> > > > discontinued authentication
> > > > * where KDC was simply not responding because of the load
> > > > * where KDC was intermittently had some outage (this was the most
> nasty
> > > > thing)
> > > >
> > > > Since you're managing relatively big cluster then you know that KDC
> is
> > not
> > > > only used by Spark/Flink workloads
> > > > but the whole company IT infrastructure is bombing it so it really
> > depends
> > > > on other factors too whether KDC is reaching
> > > > it's limit or not. Not sure what kind of evidence are you looking for
> > but
> > > > I'm not authorized to share any information about
> > > > our clients data.
> > > >
> > > > One thing is for sure. The more external system types are used in
> > > > workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > > > are authenticating through KDC the more possibility to reach this
> > > > threshold when the cluster is big enough.
> > > >
> > > > All in all this feature is here to help all users never reach this
> > > > limitation.
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > > On Thu, Jan 13, 2022 at 1:00 PM 张俊帆  wrote:
> > > >
> > > > > Hi G
> > > > >
> > > > > Thanks for your quick reply. I think reserving the config of
> > > > > *security.kerberos.fetch.delegation-token*
> > > > > and simplifying disable the token fetching is a good idea.By the
> way,
> > > > > maybe this should be added
> > > > > in the migration plan or intergation section in the FLIP-211.
> > > > >
> > > > > Besides, I have a question that the KDC will collapse when the
> > cluster
> > > > > reached 200 nodes you described
> > > > > in the google doc. Do you have any attachment or reference to prove
> > it?
> > > > > Because in our internal per-cluster,
> > > > > the nodes reaches > 1000 and KDC looks good. Do i missed or
> > misunderstood
> > > > > something? Please correct me.
> > > > >
> > > > > Best
> > > > > JunFan.
> > > > > On Jan 13, 2022, 5:26 PM +0800, dev@flink.apache.org, wrote:
> > > > > >
> > > > > >
> > > > >
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework