Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Robert Metzger Fri, 28 Jan 2022 08:06:47 -0800

Hey Gabor,

let me know your cwiki username, and I can give you write permissions.



On Fri, Jan 28, 2022 at 4:05 PM Gabor Somogyi <[email protected]>
wrote:

> Thanks for making the design better! No further thing to discuss from my
> side.
>
> Started to reflect the agreement in the FLIP doc.
> Since I don't have access to the wiki I need to ask Marci to do that which
> may take some time.
>
> G
>
>
> On Fri, Jan 28, 2022 at 3:52 PM David Morávek <[email protected]> wrote:
>
> > Hi,
> >
> > AFAIU an under registration TM is not added to the registered TMs map
> until
> > > RegistrationResponse ..
> > >
> >
> > I think you're right, with a careful design around threading (delegating
> > update broadcasts to the main thread) + synchronous initial update (that
> > would be nice to avoid) this should be doable.
> >
> > Not sure what you mean "we can't register the TM without providing it
> with
> > > token" but in unsecure configuration registration must happen w/o
> tokens.
> > >
> >
> > Exactly as you describe it, this was meant only for the "kerberized /
> > secured" cluster case, in other cases we wouldn't enforce a non-null
> token
> > in the response
> >
> > I think this is a good idea in general.
> > >
> >
> > +1
> >
> > If you don't have any more thoughts on the RPC / lifecycle part, can you
> > please reflect it into the FLIP?
> >
> > D.
> >
> > On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi <[email protected]
> >
> > wrote:
> >
> > > > - Make sure DTs issued by single DTMs are monotonically increasing
> (can
> > > be
> > > sorted on TM side)
> > >
> > > AFAIU an under registration TM is not added to the registered TMs map
> > until
> > > RegistrationResponse
> > > is processed which would contain the initial tokens. If that's true
> then
> > > how is it possible to have race with
> > > DTM update which is working on the registered TMs list?
> > > To be more specific "taskExecutors" is the registered map of TMs to
> which
> > > DTM can send updated tokens
> > > but this doesn't contain the under registration TM while
> > > RegistrationResponse is not processed, right?
> > >
> > > Of course if DTM can update while RegistrationResponse is processed
> then
> > > somehow sorting would be
> > > required and that case I would agree.
> > >
> > > - Scope DT updates by the RM ID and ensure that TM only accepts update
> > from
> > > the current leader
> > >
> > > I've planned this initially the mentioned way so agreed.
> > >
> > > - Return initial token with the RegistrationResponse, which should make
> > the
> > > RPC contract bit clearer (ensure that we can't register the TM without
> > > providing it with token)
> > >
> > > I think this is a good idea in general. Not sure what you mean "we
> can't
> > > register the TM without
> > > providing it with token" but in unsecure configuration registration
> must
> > > happen w/o tokens.
> > > All in all the newly added tokens field must be somehow optional.
> > >
> > > G
> > >
> > >
> > > On Fri, Jan 28, 2022 at 2:22 PM David Morávek <[email protected]> wrote:
> > >
> > > > We had a long discussion with Chesnay about the possible edge cases
> and
> > > it
> > > > basically boils down to the following two scenarios:
> > > >
> > > > 1) There is a possible race condition between TM registration (the
> > first
> > > DT
> > > > update) and token refresh if they happen simultaneously. Than the
> > > > registration might beat the refreshed token. This could be easily
> > > addressed
> > > > if DTs could be sorted (eg. by the expiration time) on the TM side.
> In
> > > > other words, if there are multiple updates at the same time we need
> to
> > > make
> > > > sure that we have a deterministic way of choosing the latest one.
> > > >
> > > > One idea by Chesnay that popped up during this discussion was whether
> > we
> > > > could simply return the initial token with the RegistrationResponse
> to
> > > > avoid making an extra call during the TM registration.
> > > >
> > > > 2) When the RM leadership changes (eg. because zookeeper session
> times
> > > out)
> > > > there might be a race condition where the old RM is shutting down and
> > > > updates the tokens, that it might again beat the registration token
> of
> > > the
> > > > new RM. This could be avoided if we scope the token by
> > > _ResourceManagerId_
> > > > and only accept updates for the current leader (basically we'd have
> an
> > > > extra parameter to the _updateDelegationToken_ method).
> > > >
> > > > -
> > > >
> > > > DTM is way simpler then for example slot management, which could
> > receive
> > > > updates from the JobMaster that RM might not know about.
> > > >
> > > > So if you want to go in the path you're describing it should be
> doable
> > > and
> > > > we'd propose following to cover all cases:
> > > >
> > > > - Make sure DTs issued by single DTMs are monotonically increasing
> (can
> > > be
> > > > sorted on TM side)
> > > > - Scope DT updates by the RM ID and ensure that TM only accepts
> update
> > > from
> > > > the current leader
> > > > - Return initial token with the RegistrationResponse, which should
> make
> > > the
> > > > RPC contract bit clearer (ensure that we can't register the TM
> without
> > > > providing it with token)
> > > >
> > > > Any thoughts?
> > > >
> > > >
> > > > On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > Thanks for investing your time!
> > > > >
> > > > > The first 2 bulletpoint are clear.
> > > > > If there is a chance that a TM can go to an inconsistent state
> then I
> > > > agree
> > > > > with the 3rd bulletpoint.
> > > > > Just before we agree on that I would like to learn something new
> and
> > > > > understand how is it possible that a TM
> > > > > gets corrupted? (In Spark I've never seen such thing and no
> mechanism
> > > to
> > > > > fix this but Flink is definitely not Spark)
> > > > >
> > > > > Here is my understanding:
> > > > > * DTM pushes new obtained DTs to TMs and if any exception occurs
> > then a
> > > > > retry after "security.kerberos.tokens.retry-wait"
> > > > > happens. This means DTM retries until it's not possible to send new
> > DTs
> > > > to
> > > > > all registered TMs.
> > > > > * New TM registration must fail if "updateDelegationToken" fails
> > > > > * "updateDelegationToken" fails consistently like a DB (at least I
> > plan
> > > > to
> > > > > implement it that way).
> > > > > If DTs are arriving on the TM side then a single
> > > > > "UserGroupInformation.getCurrentUser.addCredentials"
> > > > > will be called which I've never seen it failed.
> > > > > * I hope all other code parts are not touching existing DTs within
> > the
> > > > JVM
> > > > >
> > > > > I would like to emphasize I'm not against to add it just want to
> see
> > > what
> > > > > kind of problems are we facing.
> > > > > It would ease to catch bugs earlier and help in the maintenance.
> > > > >
> > > > > All in all I would buy the idea to add the 3rd bullet if we foresee
> > the
> > > > > need.
> > > > >
> > > > > G
> > > > >
> > > > >
> > > > > On Fri, Jan 28, 2022 at 10:07 AM David Morávek <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Gabor,
> > > > > >
> > > > > > This is definitely headed in a right direction +1.
> > > > > >
> > > > > > I think we still need to have a safeguard in case some of the TMs
> > > gets
> > > > > into
> > > > > > the inconsistent state though, which will also eliminate the need
> > for
> > > > > > implementing a custom retry mechanism (when
> _updateDelegationToken_
> > > > call
> > > > > > fails for some reason).
> > > > > >
> > > > > > We already have this safeguard in place for slot pool (in case
> > there
> > > > are
> > > > > > some slots in inconsistent state - eg. we haven't freed them for
> > some
> > > > > > reason) and for the partition tracker, which could be simply
> > > enhanced.
> > > > > This
> > > > > > is done via periodic heartbeat from TaskManagers to the
> > > ResourceManager
> > > > > > that contains report about state of these two components (from TM
> > > > > > perspective) so the RM can reconcile their state if necessary.
> > > > > >
> > > > > > I don't think adding an additional field to
> > > > > _TaskExecutorHeartbeatPayload_
> > > > > > should be a concern as we only heartbeat every ~ 10s by default
> and
> > > the
> > > > > new
> > > > > > field would be small compared to rest of the existing payload.
> Also
> > > > > > heartbeat doesn't need to contain the whole DT, but just some
> > > > identifier
> > > > > > which signals whether it uses the right one, that could be
> > > > significantly
> > > > > > smaller.
> > > > > >
> > > > > > This is still a PUSH based approach as the RM would again call
> the
> > > > newly
> > > > > > introduced _updateDelegationToken_ when it encounters
> inconsistency
> > > > (eg.
> > > > > > due to a temporary network partition / a race condition we didn't
> > > test
> > > > > for
> > > > > > / some other scenario we didn't think about). In practice these
> > > > > > inconsistencies are super hard to avoid and reason about (and
> > > > > unfortunately
> > > > > > yes, we see them happen from time to time), so reusing the
> existing
> > > > > > mechanism that is designed for this exact problem simplify
> things.
> > > > > >
> > > > > > To sum this up we'd have three code paths for calling
> > > > > > _updateDelegationToken_:
> > > > > > 1) When the TM registers, we push the token (if DTM already has
> it)
> > > to
> > > > it
> > > > > > 2) When DTM obtains a new token it broadcasts it to all currently
> > > > > connected
> > > > > > TMs
> > > > > > 3) When a TM gets out of sync, DTM would reconcile it's state
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > Best,
> > > > > > D.
> > > > > >
> > > > > >
> > > > > > On Wed, Jan 26, 2022 at 9:03 PM David Morávek <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Thanks the update, I'll go over it tomorrow.
> > > > > > >
> > > > > > > On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi All,
> > > > > > >>
> > > > > > >> Since it has turned out that DTM can't be added as member of
> > > > JobMaster
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> > > > > > >> >
> > > > > > >> I've
> > > > > > >> came up with a better proposal.
> > > > > > >> David, thanks for pinpointing this out, you've caught a bug in
> > the
> > > > > early
> > > > > > >> phase!
> > > > > > >>
> > > > > > >> Namely ResourceManager
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> > > > > > >> >
> > > > > > >> is
> > > > > > >> a single instance class where DTM can be added as member
> > variable.
> > > > > > >> It has a list of all already registered TMs and new TM
> > > registration
> > > > is
> > > > > > >> also
> > > > > > >> happening here.
> > > > > > >> The following can be added from logic perspective to be more
> > > > specific:
> > > > > > >> * Create new DTM instance in ResourceManager
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> > > > > > >> >
> > > > > > >> and
> > > > > > >> start it (re-occurring thread to obtain new tokens)
> > > > > > >> * Add a new function named "updateDelegationTokens" to
> > > > > > TaskExecutorGateway
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54
> > > > > > >> >
> > > > > > >> * Call "updateDelegationTokens" on all registered TMs to
> > propagate
> > > > new
> > > > > > DTs
> > > > > > >> * In case of new TM registration call "updateDelegationTokens"
> > > > before
> > > > > > >> registration succeeds to setup new TM properly
> > > > > > >>
> > > > > > >> This way:
> > > > > > >> * only a single DTM would live within a cluster which is the
> > > > expected
> > > > > > >> behavior
> > > > > > >> * DTM is going to be added to a central place where all
> > deployment
> > > > > > target
> > > > > > >> can make use of it
> > > > > > >> * DTs are going to be pushed to TMs which would generate less
> > > > network
> > > > > > >> traffic than pull based approach
> > > > > > >> (please see my previous mail where I've described both
> > approaches)
> > > > > > >> * HA scenario is going to be consistent because such
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069
> > > > > > >> >
> > > > > > >> a solution can be added to "updateDelegationTokens"
> > > > > > >>
> > > > > > >> @David or all others plz share whether you agree on this or
> you
> > > have
> > > > > > >> better
> > > > > > >> idea/suggestion.
> > > > > > >>
> > > > > > >> BR,
> > > > > > >> G
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi <
> > > > > > [email protected]
> > > > > > >> >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > First of all thanks for investing your time and helping me
> > out.
> > > > As I
> > > > > > see
> > > > > > >> > you have pretty solid knowledge in the RPC area.
> > > > > > >> > I would like to rely on your knowledge since I'm learning
> this
> > > > part.
> > > > > > >> >
> > > > > > >> > > - Do we need to introduce a new RPC method or can we for
> > > example
> > > > > > >> > piggyback
> > > > > > >> > on heartbeats?
> > > > > > >> >
> > > > > > >> > I'm fine with either solution but one thing is important
> > > > > conceptually.
> > > > > > >> > There are fundamentally 2 ways how tokens can be updated:
> > > > > > >> > - Push way: When there are new DTs then JM JVM pushes DTs to
> > TM
> > > > > JVMs.
> > > > > > >> This
> > > > > > >> > is the preferred one since tiny amount of control logic
> > needed.
> > > > > > >> > - Pull way: Each time a TM would like to poll JM whether
> there
> > > are
> > > > > new
> > > > > > >> > tokens and each TM wants to decide alone whether DTs needs
> to
> > be
> > > > > > >> updated or
> > > > > > >> > not.
> > > > > > >> > As you've mentioned here some ID needs to be generated, it
> > would
> > > > > > >> generated
> > > > > > >> > quite some additional network traffic which can be
> definitely
> > > > > avoided.
> > > > > > >> > As a final thought in Spark we've had this way of DT
> > propagation
> > > > > logic
> > > > > > >> and
> > > > > > >> > we've had major issues with it.
> > > > > > >> >
> > > > > > >> > So all in all DTM needs to obtain new tokens and there must
> a
> > > way
> > > > to
> > > > > > >> send
> > > > > > >> > this data to all TMs from JM.
> > > > > > >> >
> > > > > > >> > > - What delivery semantics are we looking for? (what if
> we're
> > > > only
> > > > > > >> able to
> > > > > > >> > update subset of TMs / what happens if we exhaust retries /
> > > should
> > > > > we
> > > > > > >> even
> > > > > > >> > have the retry mechanism whatsoever) - I have a feeling that
> > > > somehow
> > > > > > >> > leveraging the existing heartbeat mechanism could help to
> > answer
> > > > > these
> > > > > > >> > questions
> > > > > > >> >
> > > > > > >> > Let's go through these questions one by one.
> > > > > > >> > > What delivery semantics are we looking for?
> > > > > > >> >
> > > > > > >> > DTM must receive an exception when at least one TM was not
> > able
> > > to
> > > > > get
> > > > > > >> DTs.
> > > > > > >> >
> > > > > > >> > > what if we're only able to update subset of TMs?
> > > > > > >> >
> > > > > > >> > Such case DTM will reschedule token obtain after
> > > > > > >> > "security.kerberos.tokens.retry-wait" time.
> > > > > > >> >
> > > > > > >> > > what happens if we exhaust retries?
> > > > > > >> >
> > > > > > >> > There is no number of retries. In default configuration
> tokens
> > > > needs
> > > > > > to
> > > > > > >> be
> > > > > > >> > re-obtained after one day.
> > > > > > >> > DTM tries to obtain new tokens after 1day * 0.75
> > > > > > >> > (security.kerberos.tokens.renewal-ratio) = 18 hours.
> > > > > > >> > When fails it retries after
> > > "security.kerberos.tokens.retry-wait"
> > > > > > which
> > > > > > >> is
> > > > > > >> > 1 hour by default.
> > > > > > >> > If it never succeeds then authentication error is going to
> > > happen
> > > > on
> > > > > > the
> > > > > > >> > TM side and the workload is
> > > > > > >> > going to stop.
> > > > > > >> >
> > > > > > >> > > should we even have the retry mechanism whatsoever?
> > > > > > >> >
> > > > > > >> > Yes, because there are always temporary cluster issues.
> > > > > > >> >
> > > > > > >> > > What does it mean for the running application (how does
> this
> > > > look
> > > > > > like
> > > > > > >> > from
> > > > > > >> > the user perspective)? As far as I remember the logs are
> only
> > > > > > collected
> > > > > > >> > ("aggregated") after the container is stopped, is that
> > correct?
> > > > > > >> >
> > > > > > >> > With default config it works like that but it can be forced
> to
> > > > > > aggregate
> > > > > > >> > at specific intervals.
> > > > > > >> > A useful feature is forcing YARN to aggregate logs while the
> > job
> > > > is
> > > > > > >> still
> > > > > > >> > running.
> > > > > > >> > For long-running jobs such as streaming jobs, this is
> > > invaluable.
> > > > To
> > > > > > do
> > > > > > >> > this,
> > > > > > >> >
> > > yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
> > > > > must
> > > > > > >> be
> > > > > > >> > set to a non-negative value.
> > > > > > >> > When this is set, a timer will be set for the given
> duration,
> > > and
> > > > > > >> whenever
> > > > > > >> > that timer goes off,
> > > > > > >> > log aggregation will run on new files.
> > > > > > >> >
> > > > > > >> > > I think
> > > > > > >> > this topic should get its own section in the FLIP (having
> some
> > > > cross
> > > > > > >> > reference to YARN ticket would be really useful, but I'm not
> > > sure
> > > > if
> > > > > > >> there
> > > > > > >> > are any).
> > > > > > >> >
> > > > > > >> > I think this is important knowledge but this FLIP is not
> > > touching
> > > > > the
> > > > > > >> > already existing behavior.
> > > > > > >> > DTs are set on the AM container which is renewed by YARN
> until
> > > > it's
> > > > > > not
> > > > > > >> > possible anymore.
> > > > > > >> > Any kind of new code is not going to change this limitation.
> > > BTW,
> > > > > > there
> > > > > > >> is
> > > > > > >> > no jira for this.
> > > > > > >> > If you think it worth to write this down then I think the
> good
> > > > place
> > > > > > is
> > > > > > >> > the official security doc
> > > > > > >> > area as caveat.
> > > > > > >> >
> > > > > > >> > > If we split the FLIP into two parts / sections that I've
> > > > > suggested,
> > > > > > I
> > > > > > >> > don't
> > > > > > >> > really think that you need to explicitly test for each
> > > deployment
> > > > > > >> scenario
> > > > > > >> > / cluster framework, because the DTM part is completely
> > > > independent
> > > > > of
> > > > > > >> the
> > > > > > >> > deployment target. Basically this is what I'm aiming for
> with
> > > > > "making
> > > > > > it
> > > > > > >> > work with the standalone" (as simple as starting a new java
> > > > process)
> > > > > > >> Flink
> > > > > > >> > first (which is also how most people deploy streaming
> > > application
> > > > on
> > > > > > k8s
> > > > > > >> > and the direction we're pushing forward with the
> auto-scaling
> > /
> > > > > > reactive
> > > > > > >> > mode initiatives).
> > > > > > >> >
> > > > > > >> > I see your point and agree the main direction. k8s is the
> > > > megatrend
> > > > > > >> which
> > > > > > >> > most of the peoples
> > > > > > >> > will use sooner or later. Not 100% sure what kind of split
> you
> > > > > suggest
> > > > > > >> but
> > > > > > >> > in my view
> > > > > > >> > the main target is to add this feature and I'm open to any
> > > logical
> > > > > > work
> > > > > > >> > ordering.
> > > > > > >> > Please share the specific details and we work it out...
> > > > > > >> >
> > > > > > >> > G
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Mon, Jan 24, 2022 at 3:04 PM David Morávek <
> > [email protected]>
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> >> >
> > > > > > >> >> > Could you point to a code where you think it could be
> added
> > > > > > exactly?
> > > > > > >> A
> > > > > > >> >> > helping hand is welcome here 🙂
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >> I think you can take a look at
> > > _ResourceManagerPartitionTracker_
> > > > > [1]
> > > > > > >> which
> > > > > > >> >> seems to have somewhat similar properties to the DTM.
> > > > > > >> >>
> > > > > > >> >> One topic that needs to be addressed there is how the RPC
> > with
> > > > the
> > > > > > >> >> _TaskExecutorGateway_ should look like.
> > > > > > >> >> - Do we need to introduce a new RPC method or can we for
> > > example
> > > > > > >> piggyback
> > > > > > >> >> on heartbeats?
> > > > > > >> >> - What delivery semantics are we looking for? (what if
> we're
> > > only
> > > > > > able
> > > > > > >> to
> > > > > > >> >> update subset of TMs / what happens if we exhaust retries /
> > > > should
> > > > > we
> > > > > > >> even
> > > > > > >> >> have the retry mechanism whatsoever) - I have a feeling
> that
> > > > > somehow
> > > > > > >> >> leveraging the existing heartbeat mechanism could help to
> > > answer
> > > > > > these
> > > > > > >> >> questions
> > > > > > >> >>
> > > > > > >> >> In short, after DT reaches it's max lifetime then log
> > > aggregation
> > > > > > stops
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >> What does it mean for the running application (how does
> this
> > > look
> > > > > > like
> > > > > > >> >> from
> > > > > > >> >> the user perspective)? As far as I remember the logs are
> only
> > > > > > collected
> > > > > > >> >> ("aggregated") after the container is stopped, is that
> > > correct? I
> > > > > > think
> > > > > > >> >> this topic should get its own section in the FLIP (having
> > some
> > > > > cross
> > > > > > >> >> reference to YARN ticket would be really useful, but I'm
> not
> > > sure
> > > > > if
> > > > > > >> there
> > > > > > >> >> are any).
> > > > > > >> >>
> > > > > > >> >> All deployment modes (per-job, per-app, ...) are planned to
> > be
> > > > > tested
> > > > > > >> and
> > > > > > >> >> > expect to work with the initial implementation however
> not
> > > all
> > > > > > >> >> deployment
> > > > > > >> >> > targets (k8s, local, ...
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >> If we split the FLIP into two parts / sections that I've
> > > > > suggested, I
> > > > > > >> >> don't
> > > > > > >> >> really think that you need to explicitly test for each
> > > deployment
> > > > > > >> scenario
> > > > > > >> >> / cluster framework, because the DTM part is completely
> > > > independent
> > > > > > of
> > > > > > >> the
> > > > > > >> >> deployment target. Basically this is what I'm aiming for
> with
> > > > > "making
> > > > > > >> it
> > > > > > >> >> work with the standalone" (as simple as starting a new java
> > > > > process)
> > > > > > >> Flink
> > > > > > >> >> first (which is also how most people deploy streaming
> > > application
> > > > > on
> > > > > > >> k8s
> > > > > > >> >> and the direction we're pushing forward with the
> > auto-scaling /
> > > > > > >> reactive
> > > > > > >> >> mode initiatives).
> > > > > > >> >>
> > > > > > >> >> The whole integration with YARN (let's forget about log
> > > > aggregation
> > > > > > >> for a
> > > > > > >> >> moment) / k8s-native only boils down to how do we make the
> > > keytab
> > > > > > file
> > > > > > >> >> local to the JobManager so the DTM can read it, so it's
> > > basically
> > > > > > >> built on
> > > > > > >> >> top of that. The only special thing that needs to be tested
> > > there
> > > > > is
> > > > > > >> the
> > > > > > >> >> "keytab distribution" code path.
> > > > > > >> >>
> > > > > > >> >> [1]
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java
> > > > > > >> >>
> > > > > > >> >> Best,
> > > > > > >> >> D.
> > > > > > >> >>
> > > > > > >> >> On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi <
> > > > > > >> [email protected]
> > > > > > >> >> >
> > > > > > >> >> wrote:
> > > > > > >> >>
> > > > > > >> >> > > There is a separate JobMaster for each job
> > > > > > >> >> > within a Flink cluster and each JobMaster only has a
> > partial
> > > > view
> > > > > > of
> > > > > > >> the
> > > > > > >> >> > task managers
> > > > > > >> >> >
> > > > > > >> >> > Good point! I've had a deeper look and you're right. We
> > > > > definitely
> > > > > > >> need
> > > > > > >> >> to
> > > > > > >> >> > find another place.
> > > > > > >> >> >
> > > > > > >> >> > > Related per-cluster or per-job keytab:
> > > > > > >> >> >
> > > > > > >> >> > In the current code per-cluster keytab is implemented and
> > I'm
> > > > > > >> intended
> > > > > > >> >> to
> > > > > > >> >> > keep it like this within this FLIP. The reason is simple:
> > > > tokens
> > > > > on
> > > > > > >> TM
> > > > > > >> >> side
> > > > > > >> >> > can be stored within the UserGroupInformation (UGI)
> > structure
> > > > > which
> > > > > > >> is
> > > > > > >> >> > global. I'm not telling it's impossible to change that
> but
> > I
> > > > > think
> > > > > > >> that
> > > > > > >> >> > this is such a complexity which the initial
> implementation
> > is
> > > > not
> > > > > > >> >> required
> > > > > > >> >> > to contain. Additionally we've not seen such need from
> user
> > > > side.
> > > > > > If
> > > > > > >> the
> > > > > > >> >> > need may rise later on then another FLIP with this topic
> > can
> > > be
> > > > > > >> created
> > > > > > >> >> and
> > > > > > >> >> > discussed. Proper multi-UGI handling within a single JVM
> > is a
> > > > > topic
> > > > > > >> >> where
> > > > > > >> >> > several round of deep-dive with the Hadoop/YARN guys are
> > > > > required.
> > > > > > >> >> >
> > > > > > >> >> > > single DTM instance embedded with
> > > > > > >> >> > the ResourceManager (the Flink component)
> > > > > > >> >> >
> > > > > > >> >> > Could you point to a code where you think it could be
> added
> > > > > > exactly?
> > > > > > >> A
> > > > > > >> >> > helping hand is welcome here🙂
> > > > > > >> >> >
> > > > > > >> >> > > Then the single (initial) implementation should work
> with
> > > all
> > > > > the
> > > > > > >> >> > deployments modes out of the box (which is not what the
> > FLIP
> > > > > > >> suggests).
> > > > > > >> >> Is
> > > > > > >> >> > that correct?
> > > > > > >> >> >
> > > > > > >> >> > All deployment modes (per-job, per-app, ...) are planned
> to
> > > be
> > > > > > tested
> > > > > > >> >> and
> > > > > > >> >> > expect to work with the initial implementation however
> not
> > > all
> > > > > > >> >> deployment
> > > > > > >> >> > targets (k8s, local, ...) are not intended to be tested.
> > Per
> > > > > > >> deployment
> > > > > > >> >> > target new jira needs to be created where I expect small
> > > number
> > > > > of
> > > > > > >> codes
> > > > > > >> >> > needs to be added and relatively expensive testing effort
> > is
> > > > > > >> required.
> > > > > > >> >> >
> > > > > > >> >> > > I've taken a look into the prototype and in the
> > > > > > >> >> "YarnClusterDescriptor"
> > > > > > >> >> > you're injecting a delegation token into the AM [1]
> (that's
> > > > > > obtained
> > > > > > >> >> using
> > > > > > >> >> > the provided keytab). If I understand this correctly from
> > > > > previous
> > > > > > >> >> > discussion / FLIP, this is to support log aggregation and
> > DT
> > > > has
> > > > > a
> > > > > > >> >> limited
> > > > > > >> >> > validity. How is this DT going to be renewed?
> > > > > > >> >> >
> > > > > > >> >> > You're clever and touched a limitation which Spark has
> too.
> > > In
> > > > > > short,
> > > > > > >> >> after
> > > > > > >> >> > DT reaches it's max lifetime then log aggregation stops.
> > I've
> > > > had
> > > > > > >> >> several
> > > > > > >> >> > deep-dive rounds with the YARN guys at Spark years
> because
> > > > wanted
> > > > > > to
> > > > > > >> >> fill
> > > > > > >> >> > this gap. They can't provide us any way to re-inject the
> > > newly
> > > > > > >> obtained
> > > > > > >> >> DT
> > > > > > >> >> > so at the end I gave up this.
> > > > > > >> >> >
> > > > > > >> >> > BR,
> > > > > > >> >> > G
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >> > On Mon, 24 Jan 2022, 11:00 David Morávek, <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >> >> >
> > > > > > >> >> > > Hi Gabor,
> > > > > > >> >> > >
> > > > > > >> >> > > There is actually a huge difference between JobManager
> > > > > (process)
> > > > > > >> and
> > > > > > >> >> > > JobMaster (job coordinator). The naming is
> unfortunately
> > > bit
> > > > > > >> >> misleading
> > > > > > >> >> > > here from historical reasons. There is a separate
> > JobMaster
> > > > for
> > > > > > >> each
> > > > > > >> >> job
> > > > > > >> >> > > within a Flink cluster and each JobMaster only has a
> > > partial
> > > > > view
> > > > > > >> of
> > > > > > >> >> the
> > > > > > >> >> > > task managers (depends on where the slots for a
> > particular
> > > > job
> > > > > > are
> > > > > > >> >> > > allocated). This means that you'll end up with N
> > > > > > >> >> > "DelegationTokenManagers"
> > > > > > >> >> > > competing with each other (N = number of running jobs
> in
> > > the
> > > > > > >> cluster).
> > > > > > >> >> > >
> > > > > > >> >> > > This makes me think we're mixing two abstraction levels
> > > here:
> > > > > > >> >> > >
> > > > > > >> >> > > a) Per-cluster delegation tokens
> > > > > > >> >> > > - Simpler approach, it would involve a single DTM
> > instance
> > > > > > embedded
> > > > > > >> >> with
> > > > > > >> >> > > the ResourceManager (the Flink component)
> > > > > > >> >> > > b) Per-job delegation tokens
> > > > > > >> >> > > - More complex approach, but could be more flexible
> from
> > > the
> > > > > user
> > > > > > >> >> side of
> > > > > > >> >> > > things.
> > > > > > >> >> > > - Multiple DTM instances, that are bound with the
> > JobMaster
> > > > > > >> lifecycle.
> > > > > > >> >> > > Delegation tokens are attached with a particular slots
> > that
> > > > are
> > > > > > >> >> executing
> > > > > > >> >> > > the job tasks instead of the whole task manager (TM
> could
> > > be
> > > > > > >> executing
> > > > > > >> >> > > multiple jobs with different tokens).
> > > > > > >> >> > > - The question is which keytab should be used for the
> > > > > clustering
> > > > > > >> >> > framework,
> > > > > > >> >> > > to support log aggregation on YARN (an extra keytab,
> > keytab
> > > > > that
> > > > > > >> comes
> > > > > > >> >> > with
> > > > > > >> >> > > the first job?)
> > > > > > >> >> > >
> > > > > > >> >> > > I think these are the things that need to be clarified
> in
> > > the
> > > > > > FLIP
> > > > > > >> >> before
> > > > > > >> >> > > proceeding.
> > > > > > >> >> > >
> > > > > > >> >> > > A follow-up question for getting a better understanding
> > > where
> > > > > > this
> > > > > > >> >> should
> > > > > > >> >> > > be headed: Are there any use cases where user may want
> to
> > > use
> > > > > > >> >> different
> > > > > > >> >> > > keytabs with each job, or are we fine with using a
> > > > cluster-wide
> > > > > > >> >> keytab?
> > > > > > >> >> > If
> > > > > > >> >> > > we go with per-cluster keytabs, is it OK that all jobs
> > > > > submitted
> > > > > > >> into
> > > > > > >> >> > this
> > > > > > >> >> > > cluster can access it (even the future ones)? Should
> this
> > > be
> > > > a
> > > > > > >> >> security
> > > > > > >> >> > > concern?
> > > > > > >> >> > >
> > > > > > >> >> > > Presume you though I would implement a new class with
> > > > > JobManager
> > > > > > >> name.
> > > > > > >> >> > The
> > > > > > >> >> > > > plan is not that.
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> > > I've never suggested such thing.
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> > > > No. That said earlier DT handling is planned to be
> done
> > > > > > >> completely
> > > > > > >> >> in
> > > > > > >> >> > > > Flink. DTM has a renewal thread which re-obtains
> tokens
> > > in
> > > > > the
> > > > > > >> >> proper
> > > > > > >> >> > > time
> > > > > > >> >> > > > when needed.
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> > > Then the single (initial) implementation should work
> with
> > > all
> > > > > the
> > > > > > >> >> > > deployments modes out of the box (which is not what the
> > > FLIP
> > > > > > >> >> suggests).
> > > > > > >> >> > Is
> > > > > > >> >> > > that correct?
> > > > > > >> >> > >
> > > > > > >> >> > > If the cluster framework, also requires delegation
> token
> > > for
> > > > > > their
> > > > > > >> >> inner
> > > > > > >> >> > > working (this is IMO only applies to YARN), it might
> need
> > > an
> > > > > > extra
> > > > > > >> >> step
> > > > > > >> >> > > (injecting the token into application master
> container).
> > > > > > >> >> > >
> > > > > > >> >> > > Separating the individual layers (actual Flink cluster
> -
> > > > > > basically
> > > > > > >> >> making
> > > > > > >> >> > > this work with a standalone deployment  / "cluster
> > > > framework" -
> > > > > > >> >> support
> > > > > > >> >> > for
> > > > > > >> >> > > YARN log aggregation) in the FLIP would be useful.
> > > > > > >> >> > >
> > > > > > >> >> > > Reading the linked Spark readme could be useful.
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> > > I've read that, but please be patient with the
> questions,
> > > > > > Kerberos
> > > > > > >> is
> > > > > > >> >> not
> > > > > > >> >> > > an easy topic to get into and I've had a very little
> > > contact
> > > > > with
> > > > > > >> it
> > > > > > >> >> in
> > > > > > >> >> > the
> > > > > > >> >> > > past.
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> > > I've taken a look into the prototype and in the
> > > > > > >> >> "YarnClusterDescriptor"
> > > > > > >> >> > > you're injecting a delegation token into the AM [1]
> > (that's
> > > > > > >> obtained
> > > > > > >> >> > using
> > > > > > >> >> > > the provided keytab). If I understand this correctly
> from
> > > > > > previous
> > > > > > >> >> > > discussion / FLIP, this is to support log aggregation
> and
> > > DT
> > > > > has
> > > > > > a
> > > > > > >> >> > limited
> > > > > > >> >> > > validity. How is this DT going to be renewed?
> > > > > > >> >> > >
> > > > > > >> >> > > [1]
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/commit/8ab75e46013f159778ccfce52463e7bc63e395a9#diff-02416e2d6ca99e1456f9c3949f3d7c2ac523d3fe25378620c09632e4aac34e4eR1261
> > > > > > >> >> > >
> > > > > > >> >> > > Best,
> > > > > > >> >> > > D.
> > > > > > >> >> > >
> > > > > > >> >> > > On Fri, Jan 21, 2022 at 9:35 PM Gabor Somogyi <
> > > > > > >> >> [email protected]
> > > > > > >> >> > >
> > > > > > >> >> > > wrote:
> > > > > > >> >> > >
> > > > > > >> >> > > > Here is the exact class, I'm from mobile so not had a
> > > look
> > > > at
> > > > > > the
> > > > > > >> >> exact
> > > > > > >> >> > > > class name:
> > > > > > >> >> > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> > > > > > >> >> > > > That keeps track of TMs where the tokens can be sent
> > to.
> > > > > > >> >> > > >
> > > > > > >> >> > > > > My feeling would be that we shouldn't really
> > introduce
> > > a
> > > > > new
> > > > > > >> >> > component
> > > > > > >> >> > > > with
> > > > > > >> >> > > > a custom lifecycle, but rather we should try to
> > > incorporate
> > > > > > this
> > > > > > >> >> into
> > > > > > >> >> > > > existing ones.
> > > > > > >> >> > > >
> > > > > > >> >> > > > Can you be more specific? Presume you though I would
> > > > > implement
> > > > > > a
> > > > > > >> new
> > > > > > >> >> > > class
> > > > > > >> >> > > > with JobManager name. The plan is not that.
> > > > > > >> >> > > >
> > > > > > >> >> > > > > If I understand this correctly, this means that we
> > then
> > > > > push
> > > > > > >> the
> > > > > > >> >> > token
> > > > > > >> >> > > > renewal logic to YARN.
> > > > > > >> >> > > >
> > > > > > >> >> > > > No. That said earlier DT handling is planned to be
> done
> > > > > > >> completely
> > > > > > >> >> in
> > > > > > >> >> > > > Flink. DTM has a renewal thread which re-obtains
> tokens
> > > in
> > > > > the
> > > > > > >> >> proper
> > > > > > >> >> > > time
> > > > > > >> >> > > > when needed. YARN log aggregation is a totally
> > different
> > > > > > feature,
> > > > > > >> >> where
> > > > > > >> >> > > > YARN does the renewal. Log aggregation was an example
> > why
> > > > the
> > > > > > >> code
> > > > > > >> >> > can't
> > > > > > >> >> > > be
> > > > > > >> >> > > > 100% reusable for all resource managers. Reading the
> > > linked
> > > > > > Spark
> > > > > > >> >> > readme
> > > > > > >> >> > > > could be useful.
> > > > > > >> >> > > >
> > > > > > >> >> > > > G
> > > > > > >> >> > > >
> > > > > > >> >> > > > On Fri, 21 Jan 2022, 21:05 David Morávek, <
> > > [email protected]
> > > > >
> > > > > > >> wrote:
> > > > > > >> >> > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > JobManager is the Flink class.
> > > > > > >> >> > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > There is no such class in Flink. The closest thing
> to
> > > the
> > > > > > >> >> JobManager
> > > > > > >> >> > > is a
> > > > > > >> >> > > > > ClusterEntrypoint. The cluster entrypoint spawns
> new
> > RM
> > > > > > Runner
> > > > > > >> &
> > > > > > >> >> > > > Dispatcher
> > > > > > >> >> > > > > Runner that start participating in the leader
> > election.
> > > > > Once
> > > > > > >> they
> > > > > > >> >> > gain
> > > > > > >> >> > > > > leadership they spawn the actual underlying
> instances
> > > of
> > > > > > these
> > > > > > >> two
> > > > > > >> >> > > "main
> > > > > > >> >> > > > > components".
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > My feeling would be that we shouldn't really
> > introduce
> > > a
> > > > > new
> > > > > > >> >> > component
> > > > > > >> >> > > > with
> > > > > > >> >> > > > > a custom lifecycle, but rather we should try to
> > > > incorporate
> > > > > > >> this
> > > > > > >> >> into
> > > > > > >> >> > > > > existing ones.
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > My biggest concerns would be:
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > - How would the lifecycle of the new component look
> > > like
> > > > > with
> > > > > > >> >> regards
> > > > > > >> >> > > to
> > > > > > >> >> > > > HA
> > > > > > >> >> > > > > setups. If we really try to decide to introduce a
> > > > > completely
> > > > > > >> new
> > > > > > >> >> > > > component,
> > > > > > >> >> > > > > how should this work in case of multiple JobManager
> > > > > > instances?
> > > > > > >> >> > > > > - Which components does it talk to / how? For
> example
> > > how
> > > > > > does
> > > > > > >> the
> > > > > > >> >> > > > > broadcast of new token to task managers
> > > > > (TaskManagerGateway)
> > > > > > >> look
> > > > > > >> >> > like?
> > > > > > >> >> > > > Do
> > > > > > >> >> > > > > we simply introduce a new RPC on the
> > > > ResourceManagerGateway
> > > > > > >> that
> > > > > > >> >> > > > broadcasts
> > > > > > >> >> > > > > it or does the new component need to do some kind
> of
> > > > > > >> bookkeeping
> > > > > > >> >> of
> > > > > > >> >> > > task
> > > > > > >> >> > > > > managers that it needs to notify?
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > YARN based HDFS log aggregation would not work by
> > > > dropping
> > > > > > that
> > > > > > >> >> code.
> > > > > > >> >> > > > Just
> > > > > > >> >> > > > > > to be crystal clear, the actual implementation
> > > contains
> > > > > > this
> > > > > > >> fir
> > > > > > >> >> > > > exactly
> > > > > > >> >> > > > > > this reason.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > This is the missing part +1. If I understand this
> > > > > correctly,
> > > > > > >> this
> > > > > > >> >> > means
> > > > > > >> >> > > > > that we then push the token renewal logic to YARN.
> > How
> > > do
> > > > > you
> > > > > > >> >> plan to
> > > > > > >> >> > > > > implement the renewal logic on k8s?
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > D.
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > On Fri, Jan 21, 2022 at 8:37 PM Gabor Somogyi <
> > > > > > >> >> > > [email protected]
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > wrote:
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > > > I think we might both mean something different
> by
> > > the
> > > > > RM.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > You feel it well, I've not specified these terms
> > well
> > > > in
> > > > > > the
> > > > > > >> >> > > > explanation.
> > > > > > >> >> > > > > > RM I meant resource management framework.
> > JobManager
> > > is
> > > > > the
> > > > > > >> >> Flink
> > > > > > >> >> > > > class.
> > > > > > >> >> > > > > > This means that inside JM instance there will be
> a
> > > DTM
> > > > > > >> >> instance, so
> > > > > > >> >> > > > they
> > > > > > >> >> > > > > > would have the same lifecycle. Hope I've answered
> > the
> > > > > > >> question.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > If we have tokens available on the client side,
> > why
> > > > do
> > > > > we
> > > > > > >> >> need to
> > > > > > >> >> > > set
> > > > > > >> >> > > > > > them
> > > > > > >> >> > > > > > into the AM (yarn specific concept) launch
> context?
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > YARN based HDFS log aggregation would not work by
> > > > > dropping
> > > > > > >> that
> > > > > > >> >> > code.
> > > > > > >> >> > > > > Just
> > > > > > >> >> > > > > > to be crystal clear, the actual implementation
> > > contains
> > > > > > this
> > > > > > >> fir
> > > > > > >> >> > > > exactly
> > > > > > >> >> > > > > > this reason.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > G
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > On Fri, 21 Jan 2022, 20:12 David Morávek, <
> > > > > [email protected]
> > > > > > >
> > > > > > >> >> wrote:
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > Hi Gabor,
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > 1. One thing is important, token management is
> > > > planned
> > > > > to
> > > > > > >> be
> > > > > > >> >> done
> > > > > > >> >> > > > > > > > generically within Flink and not scattered in
> > RM
> > > > > > specific
> > > > > > >> >> code.
> > > > > > >> >> > > > > > > JobManager
> > > > > > >> >> > > > > > > > has a DelegationTokenManager which obtains
> > tokens
> > > > > > >> >> time-to-time
> > > > > > >> >> > > (if
> > > > > > >> >> > > > > > > > configured properly). JM knows which
> > TaskManagers
> > > > are
> > > > > > in
> > > > > > >> >> place
> > > > > > >> >> > so
> > > > > > >> >> > > > it
> > > > > > >> >> > > > > > can
> > > > > > >> >> > > > > > > > distribute it to all TMs. That's it
> basically.
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > I think we might both mean something different
> by
> > > the
> > > > > RM.
> > > > > > >> >> > > JobManager
> > > > > > >> >> > > > is
> > > > > > >> >> > > > > > > basically just a process encapsulating multiple
> > > > > > components,
> > > > > > >> >> one
> > > > > > >> >> > of
> > > > > > >> >> > > > > which
> > > > > > >> >> > > > > > is
> > > > > > >> >> > > > > > > a ResourceManager, which is the component that
> > > > manages
> > > > > > task
> > > > > > >> >> > manager
> > > > > > >> >> > > > > > > registrations [1]. There is more or less a
> single
> > > > > > >> >> implementation
> > > > > > >> >> > of
> > > > > > >> >> > > > the
> > > > > > >> >> > > > > > RM
> > > > > > >> >> > > > > > > with plugable drivers for the active
> integrations
> > > > > (yarn,
> > > > > > >> k8s).
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > It would be great if you could share more
> details
> > > of
> > > > > how
> > > > > > >> >> exactly
> > > > > > >> >> > > the
> > > > > > >> >> > > > > DTM
> > > > > > >> >> > > > > > is
> > > > > > >> >> > > > > > > going to fit in the current JM architecture.
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > 2. 99.9% of the code is generic but each RM
> > handles
> > > > > > tokens
> > > > > > >> >> > > > > differently. A
> > > > > > >> >> > > > > > > > good example is YARN obtains tokens on client
> > > side
> > > > > and
> > > > > > >> then
> > > > > > >> >> > sets
> > > > > > >> >> > > > them
> > > > > > >> >> > > > > > on
> > > > > > >> >> > > > > > > > the newly created AM container launch
> context.
> > > This
> > > > > is
> > > > > > >> >> purely
> > > > > > >> >> > > YARN
> > > > > > >> >> > > > > > > specific
> > > > > > >> >> > > > > > > > and cant't be spared. With my actual plans
> > > > standalone
> > > > > > >> can be
> > > > > > >> >> > > > changed
> > > > > > >> >> > > > > to
> > > > > > >> >> > > > > > > use
> > > > > > >> >> > > > > > > > the framework. By using it I mean no RM
> > specific
> > > > DTM
> > > > > or
> > > > > > >> >> > > whatsoever
> > > > > > >> >> > > > is
> > > > > > >> >> > > > > > > > needed.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > If we have tokens available on the client side,
> > why
> > > > do
> > > > > we
> > > > > > >> >> need to
> > > > > > >> >> > > set
> > > > > > >> >> > > > > > them
> > > > > > >> >> > > > > > > into the AM (yarn specific concept) launch
> > context?
> > > > Why
> > > > > > >> can't
> > > > > > >> >> we
> > > > > > >> >> > > > simply
> > > > > > >> >> > > > > > > send them to the JM, eg. as a parameter of the
> > job
> > > > > > >> submission
> > > > > > >> >> /
> > > > > > >> >> > via
> > > > > > >> >> > > > > > > separate RPC call? There might be something I'm
> > > > missing
> > > > > > >> due to
> > > > > > >> >> > > > limited
> > > > > > >> >> > > > > > > knowledge, but handling the token on the
> "cluster
> > > > > > >> framework"
> > > > > > >> >> > level
> > > > > > >> >> > > > > > doesn't
> > > > > > >> >> > > > > > > seem necessary.
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > [1]
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > Best,
> > > > > > >> >> > > > > > > D.
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi <
> > > > > > >> >> > > > > [email protected]
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > wrote:
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > > Oh and one more thing. I'm planning to add
> this
> > > > > feature
> > > > > > >> in
> > > > > > >> >> > small
> > > > > > >> >> > > > > chunk
> > > > > > >> >> > > > > > of
> > > > > > >> >> > > > > > > > PRs because security is super hairy area.
> That
> > > way
> > > > > > >> reviewers
> > > > > > >> >> > can
> > > > > > >> >> > > be
> > > > > > >> >> > > > > > more
> > > > > > >> >> > > > > > > > easily obtains the concept.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > On Fri, 21 Jan 2022, 18:03 David Morávek, <
> > > > > > >> [email protected]>
> > > > > > >> >> > > wrote:
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > > Hi Gabor,
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > thanks for drafting the FLIP, I think
> having
> > a
> > > > > solid
> > > > > > >> >> Kerberos
> > > > > > >> >> > > > > support
> > > > > > >> >> > > > > > > is
> > > > > > >> >> > > > > > > > > crucial for many enterprise deployments.
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > I have multiple questions regarding the
> > > > > > implementation
> > > > > > >> >> (note
> > > > > > >> >> > > > that I
> > > > > > >> >> > > > > > > have
> > > > > > >> >> > > > > > > > > very limited knowledge of Kerberos):
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > 1) If I understand it correctly, we'll only
> > > > obtain
> > > > > > >> tokens
> > > > > > >> >> in
> > > > > > >> >> > > the
> > > > > > >> >> > > > > job
> > > > > > >> >> > > > > > > > > manager and then we'll distribute them via
> > RPC
> > > > > (needs
> > > > > > >> to
> > > > > > >> >> be
> > > > > > >> >> > > > > secured).
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > Can you please outline how the
> communication
> > > will
> > > > > > look
> > > > > > >> >> like?
> > > > > > >> >> > Is
> > > > > > >> >> > > > the
> > > > > > >> >> > > > > > > > > DelegationTokenManager going to be a part
> of
> > > the
> > > > > > >> >> > > ResourceManager?
> > > > > > >> >> > > > > Can
> > > > > > >> >> > > > > > > you
> > > > > > >> >> > > > > > > > > outline it's lifecycle / how it's going to
> be
> > > > > > >> integrated
> > > > > > >> >> > there?
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > 2) Do we really need a YARN / k8s specific
> > > > > > >> >> implementations?
> > > > > > >> >> > Is
> > > > > > >> >> > > it
> > > > > > >> >> > > > > > > > possible
> > > > > > >> >> > > > > > > > > to obtain / renew a token in a generic way?
> > > Maybe
> > > > > to
> > > > > > >> >> rephrase
> > > > > > >> >> > > > that,
> > > > > > >> >> > > > > > is
> > > > > > >> >> > > > > > > it
> > > > > > >> >> > > > > > > > > possible to implement
> DelegationTokenManager
> > > for
> > > > > the
> > > > > > >> >> > standalone
> > > > > > >> >> > > > > > Flink?
> > > > > > >> >> > > > > > > If
> > > > > > >> >> > > > > > > > > we're able to solve this point, it could be
> > > > > possible
> > > > > > to
> > > > > > >> >> > target
> > > > > > >> >> > > > all
> > > > > > >> >> > > > > > > > > deployment scenarios with a single
> > > > implementation.
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > Best,
> > > > > > >> >> > > > > > > > > D.
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > On Fri, Jan 14, 2022 at 3:47 AM Junfan
> Zhang
> > <
> > > > > > >> >> > > > > > [email protected]>
> > > > > > >> >> > > > > > > > > wrote:
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > > > > Hi G
> > > > > > >> >> > > > > > > > > >
> > > > > > >> >> > > > > > > > > > Thanks for your explain in detail. I have
> > > > gotten
> > > > > > your
> > > > > > >> >> > > thoughts,
> > > > > > >> >> > > > > and
> > > > > > >> >> > > > > > > any
> > > > > > >> >> > > > > > > > > > way this proposal
> > > > > > >> >> > > > > > > > > > is a great improvement.
> > > > > > >> >> > > > > > > > > >
> > > > > > >> >> > > > > > > > > > Looking forward to your implementation
> and
> > i
> > > > will
> > > > > > >> keep
> > > > > > >> >> > focus
> > > > > > >> >> > > on
> > > > > > >> >> > > > > it.
> > > > > > >> >> > > > > > > > > > Thanks again.
> > > > > > >> >> > > > > > > > > >
> > > > > > >> >> > > > > > > > > > Best
> > > > > > >> >> > > > > > > > > > JunFan.
> > > > > > >> >> > > > > > > > > > On Jan 13, 2022, 9:20 PM +0800, Gabor
> > > Somogyi <
> > > > > > >> >> > > > > > > > [email protected]
> > > > > > >> >> > > > > > > > > >,
> > > > > > >> >> > > > > > > > > > wrote:
> > > > > > >> >> > > > > > > > > > > Just to confirm keeping
> > > > > > >> >> > > > > > "security.kerberos.fetch.delegation-token"
> > > > > > >> >> > > > > > > is
> > > > > > >> >> > > > > > > > > > added
> > > > > > >> >> > > > > > > > > > > to the doc.
> > > > > > >> >> > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > BR,
> > > > > > >> >> > > > > > > > > > > G
> > > > > > >> >> > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > On Thu, Jan 13, 2022 at 1:34 PM Gabor
> > > > Somogyi <
> > > > > > >> >> > > > > > > > > [email protected]
> > > > > > >> >> > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > wrote:
> > > > > > >> >> > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > Hi JunFan,
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > By the way, maybe this should be
> > added
> > > in
> > > > > the
> > > > > > >> >> > migration
> > > > > > >> >> > > > > plan
> > > > > > >> >> > > > > > or
> > > > > > >> >> > > > > > > > > > > > intergation section in the FLIP-211.
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > Going to add this soon.
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > Besides, I have a question that the
> > KDC
> > > > > will
> > > > > > >> >> collapse
> > > > > > >> >> > > > when
> > > > > > >> >> > > > > > the
> > > > > > >> >> > > > > > > > > > cluster
> > > > > > >> >> > > > > > > > > > > > reached 200 nodes you described
> > > > > > >> >> > > > > > > > > > > > in the google doc. Do you have any
> > > > attachment
> > > > > > or
> > > > > > >> >> > > reference
> > > > > > >> >> > > > to
> > > > > > >> >> > > > > > > prove
> > > > > > >> >> > > > > > > > > it?
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > "KDC *may* collapse under some
> > > > circumstances"
> > > > > > is
> > > > > > >> the
> > > > > > >> >> > > proper
> > > > > > >> >> > > > > > > > wording.
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > We have several customers who are
> > > executing
> > > > > > >> >> workloads
> > > > > > >> >> > on
> > > > > > >> >> > > > > > > > Spark/Flink.
> > > > > > >> >> > > > > > > > > > Most
> > > > > > >> >> > > > > > > > > > > > of the time I'm facing their
> > > > > > >> >> > > > > > > > > > > > daily issues which is heavily
> > environment
> > > > and
> > > > > > >> >> use-case
> > > > > > >> >> > > > > > dependent.
> > > > > > >> >> > > > > > > > > I've
> > > > > > >> >> > > > > > > > > > > > seen various cases:
> > > > > > >> >> > > > > > > > > > > > * where the mentioned ~1k nodes were
> > > > working
> > > > > > fine
> > > > > > >> >> > > > > > > > > > > > * where KDC thought the number of
> > > requests
> > > > > are
> > > > > > >> >> coming
> > > > > > >> >> > > from
> > > > > > >> >> > > > > DDOS
> > > > > > >> >> > > > > > > > > attack
> > > > > > >> >> > > > > > > > > > so
> > > > > > >> >> > > > > > > > > > > > discontinued authentication
> > > > > > >> >> > > > > > > > > > > > * where KDC was simply not responding
> > > > because
> > > > > > of
> > > > > > >> the
> > > > > > >> >> > load
> > > > > > >> >> > > > > > > > > > > > * where KDC was intermittently had
> some
> > > > > outage
> > > > > > >> (this
> > > > > > >> >> > was
> > > > > > >> >> > > > the
> > > > > > >> >> > > > > > most
> > > > > > >> >> > > > > > > > > nasty
> > > > > > >> >> > > > > > > > > > > > thing)
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > Since you're managing relatively big
> > > > cluster
> > > > > > then
> > > > > > >> >> you
> > > > > > >> >> > > know
> > > > > > >> >> > > > > that
> > > > > > >> >> > > > > > > KDC
> > > > > > >> >> > > > > > > > > is
> > > > > > >> >> > > > > > > > > > not
> > > > > > >> >> > > > > > > > > > > > only used by Spark/Flink workloads
> > > > > > >> >> > > > > > > > > > > > but the whole company IT
> infrastructure
> > > is
> > > > > > >> bombing
> > > > > > >> >> it
> > > > > > >> >> > so
> > > > > > >> >> > > it
> > > > > > >> >> > > > > > > really
> > > > > > >> >> > > > > > > > > > depends
> > > > > > >> >> > > > > > > > > > > > on other factors too whether KDC is
> > > > reaching
> > > > > > >> >> > > > > > > > > > > > it's limit or not. Not sure what kind
> > of
> > > > > > evidence
> > > > > > >> >> are
> > > > > > >> >> > you
> > > > > > >> >> > > > > > looking
> > > > > > >> >> > > > > > > > for
> > > > > > >> >> > > > > > > > > > but
> > > > > > >> >> > > > > > > > > > > > I'm not authorized to share any
> > > information
> > > > > > about
> > > > > > >> >> > > > > > > > > > > > our clients data.
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > One thing is for sure. The more
> > external
> > > > > system
> > > > > > >> >> types
> > > > > > >> >> > are
> > > > > > >> >> > > > > used
> > > > > > >> >> > > > > > in
> > > > > > >> >> > > > > > > > > > > > workloads (for ex. HDFS, HBase, Hive,
> > > > Kafka)
> > > > > > >> which
> > > > > > >> >> > > > > > > > > > > > are authenticating through KDC the
> more
> > > > > > >> possibility
> > > > > > >> >> to
> > > > > > >> >> > > > reach
> > > > > > >> >> > > > > > this
> > > > > > >> >> > > > > > > > > > > > threshold when the cluster is big
> > enough.
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > All in all this feature is here to
> help
> > > all
> > > > > > users
> > > > > > >> >> never
> > > > > > >> >> > > > reach
> > > > > > >> >> > > > > > > this
> > > > > > >> >> > > > > > > > > > > > limitation.
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > BR,
> > > > > > >> >> > > > > > > > > > > > G
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > On Thu, Jan 13, 2022 at 1:00 PM 张俊帆 <
> > > > > > >> >> > > > [email protected]
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > > wrote:
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > Hi G
> > > > > > >> >> > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > Thanks for your quick reply. I
> think
> > > > > > reserving
> > > > > > >> the
> > > > > > >> >> > > config
> > > > > > >> >> > > > > of
> > > > > > >> >> > > > > > > > > > > > >
> > > > *security.kerberos.fetch.delegation-token*
> > > > > > >> >> > > > > > > > > > > > > and simplifying disable the token
> > > > fetching
> > > > > > is a
> > > > > > >> >> good
> > > > > > >> >> > > > > idea.By
> > > > > > >> >> > > > > > > the
> > > > > > >> >> > > > > > > > > way,
> > > > > > >> >> > > > > > > > > > > > > maybe this should be added
> > > > > > >> >> > > > > > > > > > > > > in the migration plan or
> intergation
> > > > > section
> > > > > > in
> > > > > > >> >> the
> > > > > > >> >> > > > > FLIP-211.
> > > > > > >> >> > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > Besides, I have a question that the
> > KDC
> > > > > will
> > > > > > >> >> collapse
> > > > > > >> >> > > > when
> > > > > > >> >> > > > > > the
> > > > > > >> >> > > > > > > > > > cluster
> > > > > > >> >> > > > > > > > > > > > > reached 200 nodes you described
> > > > > > >> >> > > > > > > > > > > > > in the google doc. Do you have any
> > > > > attachment
> > > > > > >> or
> > > > > > >> >> > > > reference
> > > > > > >> >> > > > > to
> > > > > > >> >> > > > > > > > prove
> > > > > > >> >> > > > > > > > > > it?
> > > > > > >> >> > > > > > > > > > > > > Because in our internal
> per-cluster,
> > > > > > >> >> > > > > > > > > > > > > the nodes reaches > 1000 and KDC
> > looks
> > > > > good.
> > > > > > >> Do i
> > > > > > >> >> > > missed
> > > > > > >> >> > > > or
> > > > > > >> >> > > > > > > > > > misunderstood
> > > > > > >> >> > > > > > > > > > > > > something? Please correct me.
> > > > > > >> >> > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > Best
> > > > > > >> >> > > > > > > > > > > > > JunFan.
> > > > > > >> >> > > > > > > > > > > > > On Jan 13, 2022, 5:26 PM +0800,
> > > > > > >> >> [email protected]
> > > > > > >> >> > ,
> > > > > > >> >> > > > > wrote:
> > > > > > >> >> > > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > >
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > > > > > >> >> > > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > > > >
> > > > > > >> >> > > > > > > > > >
> > > > > > >> >> > > > > > > > >
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Reply via email to