Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Chesnay Schepler Mon, 31 Jan 2022 03:25:18 -0800

You should have permissions now. Note that I saw 2 accounts matchingyour name, and I picked gaborgsomogyi.


On 31/01/2022 11:28, Gabor Somogyi wrote:

Not sure if the mentioned write right already given or not but I still
don't see any edit button.


G


On Fri, Jan 28, 2022 at 5:08 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

Hi Robert,

That would be awesome.

My cwiki username: gaborgsomogyi

G


On Fri, Jan 28, 2022 at 5:06 PM Robert Metzger <metrob...@gmail.com>
wrote:

Hey Gabor,

let me know your cwiki username, and I can give you write permissions.


On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

Thanks for making the design better! No further thing to discuss from my
side.

Started to reflect the agreement in the FLIP doc.
Since I don't have access to the wiki I need to ask Marci to do that

which

may take some time.

G


On Fri, Jan 28, 2022 at 3:52 PM David Morávek <d...@apache.org> wrote:

Hi,

AFAIU an under registration TM is not added to the registered TMs map

until

RegistrationResponse ..

I think you're right, with a careful design around threading

(delegating

update broadcasts to the main thread) + synchronous initial update

(that

would be nice to avoid) this should be doable.

Not sure what you mean "we can't register the TM without providing it

with

token" but in unsecure configuration registration must happen w/o

tokens.

Exactly as you describe it, this was meant only for the "kerberized /
secured" cluster case, in other cases we wouldn't enforce a non-null

token

in the response

I think this is a good idea in general.
+1

If you don't have any more thoughts on the RPC / lifecycle part, can

you

please reflect it into the FLIP?

D.

On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:

- Make sure DTs issued by single DTMs are monotonically increasing

(can

be
sorted on TM side)

AFAIU an under registration TM is not added to the registered TMs

map

until

RegistrationResponse
is processed which would contain the initial tokens. If that's true

then

how is it possible to have race with
DTM update which is working on the registered TMs list?
To be more specific "taskExecutors" is the registered map of TMs to

which

DTM can send updated tokens
but this doesn't contain the under registration TM while
RegistrationResponse is not processed, right?

Of course if DTM can update while RegistrationResponse is processed

then

somehow sorting would be
required and that case I would agree.

- Scope DT updates by the RM ID and ensure that TM only accepts

update

from

the current leader

I've planned this initially the mentioned way so agreed.

- Return initial token with the RegistrationResponse, which should

make

the

RPC contract bit clearer (ensure that we can't register the TM

without

providing it with token)

I think this is a good idea in general. Not sure what you mean "we

can't

register the TM without
providing it with token" but in unsecure configuration registration

must

happen w/o tokens.
All in all the newly added tokens field must be somehow optional.

G


On Fri, Jan 28, 2022 at 2:22 PM David Morávek <d...@apache.org>

wrote:

We had a long discussion with Chesnay about the possible edge

cases

and

it

basically boils down to the following two scenarios:

1) There is a possible race condition between TM registration (the

first

DT

update) and token refresh if they happen simultaneously. Than the
registration might beat the refreshed token. This could be easily

addressed

if DTs could be sorted (eg. by the expiration time) on the TM

side.

In

other words, if there are multiple updates at the same time we

need

to

make

sure that we have a deterministic way of choosing the latest one.

One idea by Chesnay that popped up during this discussion was

whether

we

could simply return the initial token with the

RegistrationResponse

to

avoid making an extra call during the TM registration.

2) When the RM leadership changes (eg. because zookeeper session

times

out)

there might be a race condition where the old RM is shutting down

and

updates the tokens, that it might again beat the registration

token

of

the

new RM. This could be avoided if we scope the token by

_ResourceManagerId_

and only accept updates for the current leader (basically we'd

have

an

extra parameter to the _updateDelegationToken_ method).

-

DTM is way simpler then for example slot management, which could

receive

updates from the JobMaster that RM might not know about.

So if you want to go in the path you're describing it should be

doable

and

we'd propose following to cover all cases:

- Make sure DTs issued by single DTMs are monotonically increasing

(can

be

sorted on TM side)
- Scope DT updates by the RM ID and ensure that TM only accepts

update

from

the current leader
- Return initial token with the RegistrationResponse, which should

make

the

RPC contract bit clearer (ensure that we can't register the TM

without

providing it with token)

Any thoughts?


On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi <

gabor.g.somo...@gmail.com>

wrote:

Thanks for investing your time!

The first 2 bulletpoint are clear.
If there is a chance that a TM can go to an inconsistent state

then I

agree

with the 3rd bulletpoint.
Just before we agree on that I would like to learn something new

and

understand how is it possible that a TM
gets corrupted? (In Spark I've never seen such thing and no

mechanism

to

fix this but Flink is definitely not Spark)

Here is my understanding:
* DTM pushes new obtained DTs to TMs and if any exception occurs

then a

retry after "security.kerberos.tokens.retry-wait"
happens. This means DTM retries until it's not possible to send

new

DTs

to

all registered TMs.
* New TM registration must fail if "updateDelegationToken" fails
* "updateDelegationToken" fails consistently like a DB (at

least I

plan

to

implement it that way).
If DTs are arriving on the TM side then a single
"UserGroupInformation.getCurrentUser.addCredentials"
will be called which I've never seen it failed.
* I hope all other code parts are not touching existing DTs

within

the

JVM

I would like to emphasize I'm not against to add it just want to

see

what

kind of problems are we facing.
It would ease to catch bugs earlier and help in the maintenance.

All in all I would buy the idea to add the 3rd bullet if we

foresee

the

need.

G


On Fri, Jan 28, 2022 at 10:07 AM David Morávek <d...@apache.org

wrote:

Hi Gabor,

This is definitely headed in a right direction +1.

I think we still need to have a safeguard in case some of the

TMs

gets

into

the inconsistent state though, which will also eliminate the

need

for

implementing a custom retry mechanism (when

_updateDelegationToken_

call

fails for some reason).

We already have this safeguard in place for slot pool (in case

there

are

some slots in inconsistent state - eg. we haven't freed them

for

some

reason) and for the partition tracker, which could be simply

enhanced.

This

is done via periodic heartbeat from TaskManagers to the

ResourceManager

that contains report about state of these two components

(from TM

perspective) so the RM can reconcile their state if necessary.

I don't think adding an additional field to

_TaskExecutorHeartbeatPayload_

should be a concern as we only heartbeat every ~ 10s by

default

and

the

new

field would be small compared to rest of the existing payload.

Also

heartbeat doesn't need to contain the whole DT, but just some

identifier

which signals whether it uses the right one, that could be

significantly

smaller.

This is still a PUSH based approach as the RM would again call

the

newly

introduced _updateDelegationToken_ when it encounters

inconsistency

(eg.

due to a temporary network partition / a race condition we

didn't

test

for

/ some other scenario we didn't think about). In practice

these

inconsistencies are super hard to avoid and reason about (and

unfortunately

yes, we see them happen from time to time), so reusing the

existing

mechanism that is designed for this exact problem simplify

things.

To sum this up we'd have three code paths for calling
_updateDelegationToken_:
1) When the TM registers, we push the token (if DTM already

has

it)

to

it

2) When DTM obtains a new token it broadcasts it to all

currently

connected

TMs
3) When a TM gets out of sync, DTM would reconcile it's state

WDYT?

Best,
D.


On Wed, Jan 26, 2022 at 9:03 PM David Morávek <

d...@apache.org>

wrote:

Thanks the update, I'll go over it tomorrow.

On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:

Hi All,

Since it has turned out that DTM can't be added as member

of

JobMaster

https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176

I've
came up with a better proposal.
David, thanks for pinpointing this out, you've caught a

bug in

the

early

phase!

Namely ResourceManager
<

https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124

is
a single instance class where DTM can be added as member

variable.

It has a list of all already registered TMs and new TM

registration

is

also
happening here.
The following can be added from logic perspective to be

more

specific:

* Create new DTM instance in ResourceManager
<

https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124

and
start it (re-occurring thread to obtain new tokens)
* Add a new function named "updateDelegationTokens" to

TaskExecutorGateway

https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54

* Call "updateDelegationTokens" on all registered TMs to

propagate

new

DTs

* In case of new TM registration call

"updateDelegationTokens"

before

registration succeeds to setup new TM properly

This way:
* only a single DTM would live within a cluster which is

the

expected

behavior
* DTM is going to be added to a central place where all

deployment

target

can make use of it
* DTs are going to be pushed to TMs which would generate

less

network

traffic than pull based approach
(please see my previous mail where I've described both

approaches)

* HA scenario is going to be consistent because such
<

https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069

a solution can be added to "updateDelegationTokens"

@David or all others plz share whether you agree on this or

you

have

better
idea/suggestion.

BR,
G


On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:

First of all thanks for investing your time and helping

me

out.

As I

see

you have pretty solid knowledge in the RPC area.
I would like to rely on your knowledge since I'm learning

this

part.

- Do we need to introduce a new RPC method or can we

for

example

piggyback
on heartbeats?

I'm fine with either solution but one thing is important

conceptually.

There are fundamentally 2 ways how tokens can be updated:
- Push way: When there are new DTs then JM JVM pushes

DTs to

TM

JVMs.

This

is the preferred one since tiny amount of control logic

needed.

- Pull way: Each time a TM would like to poll JM whether

there

are

new

tokens and each TM wants to decide alone whether DTs

needs

to

be

updated or

not.
As you've mentioned here some ID needs to be generated,

it

would

generated

quite some additional network traffic which can be

definitely

avoided.

As a final thought in Spark we've had this way of DT

propagation

logic

and

we've had major issues with it.

So all in all DTM needs to obtain new tokens and there

must

way

to

send

this data to all TMs from JM.

- What delivery semantics are we looking for? (what if

we're

only

able to

update subset of TMs / what happens if we exhaust

retries /

should

we

even

have the retry mechanism whatsoever) - I have a feeling

that

somehow

leveraging the existing heartbeat mechanism could help to

answer

these

questions

Let's go through these questions one by one.

What delivery semantics are we looking for?

DTM must receive an exception when at least one TM was

not

able

to

get

DTs.

what if we're only able to update subset of TMs?

Such case DTM will reschedule token obtain after
"security.kerberos.tokens.retry-wait" time.

what happens if we exhaust retries?

There is no number of retries. In default configuration

tokens

needs

to

be

re-obtained after one day.
DTM tries to obtain new tokens after 1day * 0.75
(security.kerberos.tokens.renewal-ratio) = 18 hours.
When fails it retries after

"security.kerberos.tokens.retry-wait"

which

is

1 hour by default.
If it never succeeds then authentication error is going

to

happen

on

the

TM side and the workload is
going to stop.

should we even have the retry mechanism whatsoever?

Yes, because there are always temporary cluster issues.

What does it mean for the running application (how does

this

look

like

from
the user perspective)? As far as I remember the logs are

only

collected

("aggregated") after the container is stopped, is that

correct?

With default config it works like that but it can be

forced

to

aggregate

at specific intervals.
A useful feature is forcing YARN to aggregate logs while

the

job

is

still

running.
For long-running jobs such as streaming jobs, this is

invaluable.

To

do

this,

yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds

must

be

set to a non-negative value.
When this is set, a timer will be set for the given

duration,

and

whenever

that timer goes off,
log aggregation will run on new files.

I think

this topic should get its own section in the FLIP (having

some

cross

reference to YARN ticket would be really useful, but I'm

not

sure

if

there

are any).

I think this is important knowledge but this FLIP is not

touching

the

already existing behavior.
DTs are set on the AM container which is renewed by YARN

until

it's

not

possible anymore.
Any kind of new code is not going to change this

limitation.

BTW,

there

is

no jira for this.
If you think it worth to write this down then I think the

good

place

is

the official security doc
area as caveat.

If we split the FLIP into two parts / sections that

I've

suggested,

don't
really think that you need to explicitly test for each

deployment

scenario

/ cluster framework, because the DTM part is completely

independent

of

the

deployment target. Basically this is what I'm aiming for

with

"making

it

work with the standalone" (as simple as starting a new

java

process)

Flink

first (which is also how most people deploy streaming

application

on

k8s

and the direction we're pushing forward with the

auto-scaling

reactive

mode initiatives).

I see your point and agree the main direction. k8s is the

megatrend

which

most of the peoples
will use sooner or later. Not 100% sure what kind of

split

you

suggest

but

in my view
the main target is to add this feature and I'm open to

any

logical

work

ordering.
Please share the specific details and we work it out...

G


On Mon, Jan 24, 2022 at 3:04 PM David Morávek <

d...@apache.org>

wrote:

Could you point to a code where you think it could be

added

exactly?

helping hand is welcome here 🙂

I think you can take a look at

_ResourceManagerPartitionTracker_

[1]

which

seems to have somewhat similar properties to the DTM.

One topic that needs to be addressed there is how the

RPC

with

the

_TaskExecutorGateway_ should look like.
- Do we need to introduce a new RPC method or can we for

example

piggyback

on heartbeats?
- What delivery semantics are we looking for? (what if

we're

only

able

to

update subset of TMs / what happens if we exhaust

retries /

should

we

even

have the retry mechanism whatsoever) - I have a feeling

that

somehow

leveraging the existing heartbeat mechanism could help

to

answer

these

questions

In short, after DT reaches it's max lifetime then log

aggregation

stops

What does it mean for the running application (how does

this

look

like

from
the user perspective)? As far as I remember the logs are

only

collected

("aggregated") after the container is stopped, is that

correct? I

think

this topic should get its own section in the FLIP

(having

some

cross

reference to YARN ticket would be really useful, but I'm

not

sure

if

there

are any).

All deployment modes (per-job, per-app, ...) are

planned to

be

tested

and

expect to work with the initial implementation however

not

all

deployment

targets (k8s, local, ...

If we split the FLIP into two parts / sections that I've

suggested, I

don't
really think that you need to explicitly test for each

deployment

scenario

/ cluster framework, because the DTM part is completely

independent

of

the

deployment target. Basically this is what I'm aiming for

with

"making

it

work with the standalone" (as simple as starting a new

java

process)

Flink

first (which is also how most people deploy streaming

application

on

k8s

and the direction we're pushing forward with the

auto-scaling /

reactive

mode initiatives).

The whole integration with YARN (let's forget about log

aggregation

for a

moment) / k8s-native only boils down to how do we make

the

keytab

file

local to the JobManager so the DTM can read it, so it's

basically

built on

top of that. The only special thing that needs to be

tested

there

is

the

"keytab distribution" code path.

[1]

https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java

Best,
D.

On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:

There is a separate JobMaster for each job

within a Flink cluster and each JobMaster only has a

partial

view

of

the

task managers

Good point! I've had a deeper look and you're right.

We

definitely

need

to

find another place.

Related per-cluster or per-job keytab:

In the current code per-cluster keytab is implemented

and

I'm

intended

to

keep it like this within this FLIP. The reason is

simple:

tokens

on

TM

side

can be stored within the UserGroupInformation (UGI)

structure

which

is

global. I'm not telling it's impossible to change that

but

think

that

this is such a complexity which the initial

implementation

is

not

required

to contain. Additionally we've not seen such need from

user

side.

If

the

need may rise later on then another FLIP with this

topic

can

be

created

and

discussed. Proper multi-UGI handling within a single

JVM

is a

topic

where

several round of deep-dive with the Hadoop/YARN guys

are

required.

single DTM instance embedded with

the ResourceManager (the Flink component)

Could you point to a code where you think it could be

added

exactly?

helping hand is welcome here🙂

Then the single (initial) implementation should work

with

all

the

deployments modes out of the box (which is not what

the

FLIP

suggests).

Is

that correct?

All deployment modes (per-job, per-app, ...) are

planned

to

be

tested

and

expect to work with the initial implementation however

not

all

deployment

targets (k8s, local, ...) are not intended to be

tested.

Per

deployment

target new jira needs to be created where I expect

small

number

of

codes

needs to be added and relatively expensive testing

effort

is

required.

I've taken a look into the prototype and in the

"YarnClusterDescriptor"

you're injecting a delegation token into the AM [1]

(that's

obtained

using

the provided keytab). If I understand this correctly

from

previous

discussion / FLIP, this is to support log aggregation

and

DT

has

limited

validity. How is this DT going to be renewed?

You're clever and touched a limitation which Spark has

too.

In

short,

after

DT reaches it's max lifetime then log aggregation

stops.

I've

had

several

deep-dive rounds with the YARN guys at Spark years

because

wanted

to

fill

this gap. They can't provide us any way to re-inject

the

newly

obtained

DT

so at the end I gave up this.

BR,
G


On Mon, 24 Jan 2022, 11:00 David Morávek, <

d...@apache.org

wrote:

Hi Gabor,

There is actually a huge difference between

JobManager

(process)

and

JobMaster (job coordinator). The naming is

unfortunately

bit

misleading

here from historical reasons. There is a separate

JobMaster

for

each

job

within a Flink cluster and each JobMaster only has a

partial

view

of

the

task managers (depends on where the slots for a

particular

job

are

allocated). This means that you'll end up with N

"DelegationTokenManagers"

competing with each other (N = number of running

jobs

in

the

cluster).

This makes me think we're mixing two abstraction

levels

here:

a) Per-cluster delegation tokens
- Simpler approach, it would involve a single DTM

instance

embedded

with

the ResourceManager (the Flink component)
b) Per-job delegation tokens
- More complex approach, but could be more flexible

from

the

user

side of

things.
- Multiple DTM instances, that are bound with the

JobMaster

lifecycle.

Delegation tokens are attached with a particular

slots

that

are

executing

the job tasks instead of the whole task manager (TM

could

be

executing

multiple jobs with different tokens).
- The question is which keytab should be used for

the

clustering

framework,

to support log aggregation on YARN (an extra keytab,

keytab

that

comes

with

the first job?)

I think these are the things that need to be

clarified

in

the

FLIP

before

proceeding.

A follow-up question for getting a better

understanding

where

this

should

be headed: Are there any use cases where user may

want

to

use

different

keytabs with each job, or are we fine with using a

cluster-wide

keytab?

If

we go with per-cluster keytabs, is it OK that all

jobs

submitted

into

this

cluster can access it (even the future ones)? Should

this

be

security

concern?

Presume you though I would implement a new class

with

JobManager

name.

The

plan is not that.

I've never suggested such thing.

No. That said earlier DT handling is planned to be

done

completely

in

Flink. DTM has a renewal thread which re-obtains

tokens

in

the

proper

time

when needed.

Then the single (initial) implementation should work

with

all

the

deployments modes out of the box (which is not what

the

FLIP

suggests).

Is

that correct?

If the cluster framework, also requires delegation

token

for

their

inner

working (this is IMO only applies to YARN), it might

need

an

extra

step

(injecting the token into application master

container).

Separating the individual layers (actual Flink

cluster

basically

making

this work with a standalone deployment  / "cluster

framework" -

support

for

YARN log aggregation) in the FLIP would be useful.

Reading the linked Spark readme could be useful.
I've read that, but please be patient with the

questions,

Kerberos

is

not

an easy topic to get into and I've had a very little

contact

with

it

in

the

past.

https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176

I've taken a look into the prototype and in the

"YarnClusterDescriptor"

you're injecting a delegation token into the AM [1]

(that's

obtained

using

the provided keytab). If I understand this correctly

from

previous

discussion / FLIP, this is to support log

aggregation

and

DT

has

limited

validity. How is this DT going to be renewed?

[1]

https://github.com/gaborgsomogyi/flink/commit/8ab75e46013f159778ccfce52463e7bc63e395a9#diff-02416e2d6ca99e1456f9c3949f3d7c2ac523d3fe25378620c09632e4aac34e4eR1261

Best,
D.

On Fri, Jan 21, 2022 at 9:35 PM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:

Here is the exact class, I'm from mobile so not

had a

look

at

the

exact

class name:

https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176

That keeps track of TMs where the tokens can be

sent

to.

My feeling would be that we shouldn't really

introduce

new

component

with
a custom lifecycle, but rather we should try to

incorporate

this

into

existing ones.

Can you be more specific? Presume you though I

would

implement

new

class

with JobManager name. The plan is not that.

If I understand this correctly, this means that

we

then

push

the

token

renewal logic to YARN.

No. That said earlier DT handling is planned to be

done

completely

in

Flink. DTM has a renewal thread which re-obtains

tokens

in

the

proper

time

when needed. YARN log aggregation is a totally

different

feature,

where

YARN does the renewal. Log aggregation was an

example

why

the

code

can't

be

100% reusable for all resource managers. Reading

the

linked

Spark

readme

could be useful.

G

On Fri, 21 Jan 2022, 21:05 David Morávek, <

d...@apache.org

wrote:

JobManager is the Flink class.


There is no such class in Flink. The closest

thing

to

the

JobManager

is a

ClusterEntrypoint. The cluster entrypoint spawns

new

RM

Runner

Dispatcher

Runner that start participating in the leader

election.

Once

they

gain

leadership they spawn the actual underlying

instances

of

these

two

"main

components".

My feeling would be that we shouldn't really

introduce

new

component

with

a custom lifecycle, but rather we should try to

incorporate

this

into

existing ones.

My biggest concerns would be:

- How would the lifecycle of the new component

look

like

with

regards

to

HA

setups. If we really try to decide to introduce

completely

new

component,

how should this work in case of multiple

JobManager

instances?

- Which components does it talk to / how? For

example

how

does

the

broadcast of new token to task managers

(TaskManagerGateway)

look

like?

Do

we simply introduce a new RPC on the

ResourceManagerGateway

that

broadcasts

it or does the new component need to do some

kind

of

bookkeeping

of

task

managers that it needs to notify?

YARN based HDFS log aggregation would not work

by

dropping

that

code.

Just

to be crystal clear, the actual implementation

contains

this

fir

exactly

this reason.

This is the missing part +1. If I understand

this

correctly,

this

means

that we then push the token renewal logic to

YARN.

How

do

you

plan to

implement the renewal logic on k8s?

D.

On Fri, Jan 21, 2022 at 8:37 PM Gabor Somogyi <

gabor.g.somo...@gmail.com

wrote:

I think we might both mean something

different

by

the

RM.

You feel it well, I've not specified these

terms

well

in

the

explanation.

RM I meant resource management framework.

JobManager

is

the

Flink

class.

This means that inside JM instance there will

be

DTM

instance, so

they

would have the same lifecycle. Hope I've

answered

the

question.

If we have tokens available on the client

side,

why

do

we

need to

set

them
into the AM (yarn specific concept) launch

context?

YARN based HDFS log aggregation would not

work by

dropping

that

code.

Just

to be crystal clear, the actual implementation

contains

this

fir

exactly

this reason.

G

On Fri, 21 Jan 2022, 20:12 David Morávek, <

d...@apache.org

wrote:

Hi Gabor,

1. One thing is important, token management

is

planned

to

be

done

generically within Flink and not

scattered in

RM

specific

code.

JobManager

has a DelegationTokenManager which obtains

tokens

time-to-time

(if

configured properly). JM knows which

TaskManagers

are

in

place

so

it

can

distribute it to all TMs. That's it

basically.


I think we might both mean something

different

by

the

RM.

JobManager

is

basically just a process encapsulating

multiple

components,

one

of

which

is

a ResourceManager, which is the component

that

manages

task

manager

registrations [1]. There is more or less a

single

implementation

of

the

RM

with plugable drivers for the active

integrations

(yarn,

k8s).

It would be great if you could share more

details

of

how

exactly

the

DTM

is

going to fit in the current JM architecture.

2. 99.9% of the code is generic but each RM

handles

tokens

differently. A

good example is YARN obtains tokens on

client

side

and

then

sets

them

on

the newly created AM container launch

context.

This

is

purely

YARN

specific

and cant't be spared. With my actual plans

standalone

can be

changed

to

use

the framework. By using it I mean no RM

specific

DTM

or

whatsoever

is

needed.

If we have tokens available on the client

side,

why

do

we

need to

set

them

into the AM (yarn specific concept) launch

context?

Why

can't

we

simply

send them to the JM, eg. as a parameter of

the

job

submission

via

separate RPC call? There might be something

I'm

missing

due to

limited

knowledge, but handling the token on the

"cluster

framework"

level

doesn't

seem necessary.

[1]

https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager

Best,
D.

On Fri, Jan 21, 2022 at 7:48 PM Gabor

Somogyi <

gabor.g.somo...@gmail.com

wrote:

Oh and one more thing. I'm planning to add

this

feature

in

small

chunk

of

PRs because security is super hairy area.

That

way

reviewers

can

be

more

easily obtains the concept.

On Fri, 21 Jan 2022, 18:03 David Morávek,

d...@apache.org>

wrote:

Hi Gabor,

thanks for drafting the FLIP, I think

having

solid

Kerberos

support

is

crucial for many enterprise deployments.

I have multiple questions regarding the

implementation

(note

that I

have

very limited knowledge of Kerberos):

1) If I understand it correctly, we'll

only

obtain

tokens

in

the

job

manager and then we'll distribute them

via

RPC

(needs

to

be

secured).

Can you please outline how the

communication

will

look

like?

Is

the

DelegationTokenManager going to be a

part

of

the

ResourceManager?

Can

you

outline it's lifecycle / how it's going

to

be

integrated

there?

2) Do we really need a YARN / k8s

specific

implementations?

Is

it

possible

to obtain / renew a token in a generic

way?

Maybe

to

rephrase

that,

is

it

possible to implement

DelegationTokenManager

for

the

standalone

Flink?

If

we're able to solve this point, it

could be

possible

to

target

all

deployment scenarios with a single

implementation.

Best,
D.

On Fri, Jan 14, 2022 at 3:47 AM Junfan

Zhang

zuston.sha...@gmail.com>

wrote:

Hi G

Thanks for your explain in detail. I

have

gotten

your

thoughts,

and

any

way this proposal
is a great improvement.

Looking forward to your implementation

and

will

keep

focus

on

it.

Thanks again.

Best
JunFan.
On Jan 13, 2022, 9:20 PM +0800, Gabor

Somogyi <

gabor.g.somo...@gmail.com

,
wrote:

Just to confirm keeping

"security.kerberos.fetch.delegation-token"

is

added

to the doc.

BR,
G


On Thu, Jan 13, 2022 at 1:34 PM

Gabor

Somogyi <

gabor.g.somo...@gmail.com

wrote:

Hi JunFan,

By the way, maybe this should be

added

in

the

migration

plan

or

intergation section in the

FLIP-211.

Going to add this soon.

Besides, I have a question that

the

KDC

will

collapse

when

the

cluster

reached 200 nodes you described
in the google doc. Do you have any

attachment

or

reference

to

prove

it?

"KDC *may* collapse under some

circumstances"

is

the

proper

wording.

We have several customers who are

executing

workloads

on

Spark/Flink.

Most

of the time I'm facing their
daily issues which is heavily

environment

and

use-case

dependent.

I've

seen various cases:
* where the mentioned ~1k nodes

were

working

fine

* where KDC thought the number of

requests

are

coming

from

DDOS

attack

so

discontinued authentication
* where KDC was simply not

responding

because

of

the

load

* where KDC was intermittently had

some

outage

(this

was

the

most

nasty

thing)

Since you're managing relatively

big

cluster

then

you

know

that

KDC

is

not

only used by Spark/Flink workloads
but the whole company IT

infrastructure

is

bombing

it

so

it

really

depends

on other factors too whether KDC

is

reaching

it's limit or not. Not sure what

kind

of

evidence

are

you

looking

for

but

I'm not authorized to share any

information

about

our clients data.

One thing is for sure. The more

external

system

types

are

used

in

workloads (for ex. HDFS, HBase,

Hive,

Kafka)

which

are authenticating through KDC the

more

possibility

to

reach

this

threshold when the cluster is big

enough.

All in all this feature is here to

help

all

users

never

reach

this

limitation.

BR,
G


On Thu, Jan 13, 2022 at 1:00 PM

张俊帆 <

zuston.sha...@gmail.com

wrote:

Hi G

Thanks for your quick reply. I

think

reserving

the

config

of

*security.kerberos.fetch.delegation-token*

and simplifying disable the

token

fetching

is a

good

idea.By

the

way,

maybe this should be added
in the migration plan or

intergation

section

in

the

FLIP-211.

Besides, I have a question that

the

KDC

will

collapse

when

the

cluster

reached 200 nodes you described
in the google doc. Do you have

any

attachment

or

reference

to

prove

it?

Because in our internal

per-cluster,

the nodes reaches > 1000 and KDC

looks

good.

Do i

missed

or

misunderstood

something? Please correct me.

Best
JunFan.
On Jan 13, 2022, 5:26 PM +0800,

dev@flink.apache.org

wrote:

https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Reply via email to