Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Ron Dagostino Mon, 17 Sep 2018 18:53:19 -0700

Hi Rajini.  The KIP is updated.  Aside from a once-over to make sure it is
all accurate, I think we need to confirm the metrics.  The decision to not
reject authentications that use tokens with too-long a lifetime allowed the
metrics to be simpler.  I decided that in addition to tracking these
metrics on the broker:


failed-reauthentication-{rate,total} and
successful-reauthentication-{rate,total}

we simply need one more set of broker metrics to track the subset of
clients clients that are not upgraded to v2.1.0 and are still using a V0
SaslAuthenticateRequest:

failed-v0-authentication-{rate,total} and
successful-v0-authentication-{rate,total}

See the Migration section of the KIP for details of how this would be used.

I wonder if we need a broker metric documenting the number of "expired"
sessions killed by the broker since it would be the same as
successful-v0-authentication-total. I've eliminated that from the KIP for
now.  Thoughts?

There is also a client-side metric for re-authentication latency tracking
(still unnamed -- do you have a preference?)

I think we're close to being able to put this KIP up for a vote.

Ron


On Mon, Sep 17, 2018 at 2:45 PM Ron Dagostino <rndg...@gmail.com> wrote:

> <<<extra field alongside errors, not in the opaque body as mentioned in
> the KIP
> Right, that was a concern I had mentioned in a follow-up email.  Agree it
> should go alongside errors.
>
> <<<SCRAM for delegation token based authentication where credentials have
> <<<an expiry time, so we do need to be able to set individual credential
> <<<lifetimes, where the server knows the expiry time but client may not
> Ah, ok, I was not aware that this case existed.  Agreed, for consistency,
> server will always send it back to the client.
>
> <<<I was trying to avoid this altogether [SaslClient negotiated property
> for lifetime]
> Since we now agree that the server will send it back, yes, there is no
> need for this.
>
> <<<wasn't entirely clear about the purpose of
> [sasl.login.refresh.reauthenticate.enable]
> I think this might have been more useful when we weren't necessarily going
> to support all SASL mechanisms and/or when the broker was not going to
> advertise the fact that it supported re-authentication.  You are correct,
> now that we support it for all SASL mechanisms and we are bumping an API
> version, I think it is okay to simply enable it wherever both the client
> and server meet the required versions.
>
> <<<wasn't entirely clear about the purpose of [connections.max.reauth.ms]
> <<<wasn't expecting that we would reject
> <<<tokens or tickets simply because they were too long-lived
> <<<tickets are granted by a 3rd party authority
> <<<Client re-authenticates even though token was not
> <<<refreshed. Does this matter?
> I was going under the assumption that it would matter, but based on your
> pushback I just realized that the same functionality can be implemented as
> part of token validation if there is a desire to limit token lifetimes to a
> certain max value (and the token validator has to be provided in production
> anyway since all we provide out-of-the-box is the unsecured validator).  So
> I'm willing to abandon this check as part of re-authentication.
>
> I'll adjust the KIP accordingly a bit later.  Thanks for the continued
> feedback/discussion.
>
> Ron
>
>
>
>
> On Mon, Sep 17, 2018 at 2:10 PM Rajini Sivaram <rajinisiva...@gmail.com>
> wrote:
>
>> Hi Ron,
>>
>>
>> *1) Is there a reason to communicate this value back to the client when
>> the
>> client already knows it?  It's an extra network round-trip, and since the
>> SASL round-tripsare defined by the spec I'm not certain adding an extra
>> round-trip is acceptable.*
>>
>> I wasn't suggesting an extra round-trip for propagating session lifetime.
>> I
>> was expecting session lifetime to be added to the last SASL_AUTHENTICATE
>> response from the broker. Because SASL is a challenge-response mechanism,
>> SaslServer knows when the response being sent is the last one and hence
>> can
>> send the session lifetime in the response (in the same way as we propagate
>> errors). I was expecting this to be added as an extra field alongside
>> errors, not in the opaque body as mentioned in the KIP. The opaque byte
>> array strictly conforms to the SASL mechanism wire protocol and we want to
>> keep it that way.
>>
>> As you have said, we don't need server to propagate session lifetime for
>> OAUTHBEARER since client knows the token lifetime. But server also knows
>> the credential lifetime and by having the server decide the lifetime, we
>> can use the same code path for all mechanisms. If we only have a constant
>> max lifetime in the server, for PLAIN and SCRAM we will end up having the
>> same lifetime for all credentials with no ability to set actual expiry. We
>> use SCRAM for delegation token based authentication where credentials have
>> an expiry time, so we do need to be able to set individual credential
>> lifetimes, where the server knows the expiry time but client may not.
>>
>> *2) I also just realized that if the client is to learn the credential
>> lifetime we wouldn't want to put special-case code in the Authenticator
>> for
>> GSSAPI and OAUTHBEARER; we would want to expose the value generically,
>> probably as a negotiated property on the SaslClient instance.*
>>
>> I was trying to avoid this altogether. Client doesn't need to know
>> credential lifetime. Server asks the client to re-authenticate within its
>> session lifetime.
>>
>> 3) From the KIP, I wasn't entirely clear about the purpose of the two
>> configs:
>>
>> sasl.login.refresh.reauthenticate.enable: Do we need this? Client knows if
>> broker version supports re-authentication based on the SASL_AUTHENTICATE
>> version returned in ApiVersionsResponse. Client knows if broker is
>> configured to enable re-authentication based on session lifetime returned
>> in SaslAuthenticateResponse. If broker has re-authentication configured
>> and
>> client supports re-authentication, you would always want re-authenticate.
>> So I wasn't sure why we need a config to opt in or out on the client-side.
>>
>>
>> connections.max.reauth.ms: We obviously need a broker-side config. Not
>> entirely sure about the semantics of the config that drives
>> re-authentication. In particular, I wasn't expecting that we would reject
>> tokens or tickets simply because they were too long-lived. Since tokens or
>> tickets are granted by a 3rd party authority, I am not sure if clients
>> will
>> always have control over the lifetime. Do we need to support any more
>> scenarios than these:
>>
>>
>> A) reauth.ms=10,credential.lifetime.ms=10 : Broker sets
>> session.lifetime=10,
>> so this works.
>>
>> B) reauth.ms=10, credential.lifetime.ms=5 : Broker sets
>> session.lifetime=5,
>> so this works.
>> C) reauth.ms=10, credential.lifetime.ms=20 : Broker sets
>> session.lifetime=10. Client re-authenticates even though token was not
>> refreshed. Does this matter?
>> D) reauth.ms=Long.MAX_VALUE, credential.lifetime.ms=10: Broker sets
>> session.lifetime=10, client re-authenticates based on credential expiry.
>> E) reauth.ms=0 (default), credential.lifetime.ms=10 : Broker sets
>> session.lifetime=0, Broker doesn't terminate sessions, client doesn't
>> re-authenticate. We generate useful metrics.
>> F) reauth.ms=0 (default),no lifetime for credential (e.g. PLAIN): Broker
>> sets session.lifetime=0, Broker doesn't terminate sessions, client doesn't
>> re-authenticate
>> G) reauth.ms=10,no lifetime for credential (e.g. PLAIN) : Broker sets
>> session.lifetime=10. Client re-authenticates.
>>
>> I would have thought that D) is the typical scenario for OAuth/Kerberos to
>> respect token expiry time. G) would be typical scenario for PLAIN to force
>> re-authenication at regular intervals. A/B/C are useful to force
>> re-authentication in scenarios where you might check for credential
>> revocation in the server. And E/F are useful to disable re-authentication
>> and generate metrics (also the default behaviour useful during migration).
>> Have I missed something?
>>
>>
>> On Mon, Sep 17, 2018 at 4:27 PM, Ron Dagostino <rndg...@gmail.com> wrote:
>>
>> > Hi yet again, Rajini.  I also just realized that if the client is to
>> learn
>> > the credential lifetime we wouldn't want to put special-case code in the
>> > Authenticator for GSSAPI and OAUTHBEARER; we would want to expose the
>> value
>> > generically, probably as a negotiated property on the SaslClient
>> instance.
>> > We might be talking about making the negotiated property key part of the
>> > public API.  In other words, the SaslClient would be responsible for
>> > exposing the credential (i.e. token or ticket) lifetime at a well-known
>> > negotiated property name, such as "Credential.Lifetime" and putting a
>> Long
>> > value there if there is a lifetime.  That well-klnown key (e.g.
>> > "Credential.Lifetime") would be part of the public API, right?
>> >
>> > Ron
>> >
>> > On Mon, Sep 17, 2018 at 11:03 AM Ron Dagostino <rndg...@gmail.com>
>> wrote:
>> >
>> > > Hi again, Rajini.  After thinking about this a little while, it
>> occurs to
>> > > me that maybe the communication of max session lifetime should occur
>> via
>> > > SASL_HANDSHAKE after all.  Here's why.  The value communicated is the
>> max
>> > > session lifetime allowed, and the client can assume it is the the
>> session
>> > > lifetime for that particular session unless the particular SASL
>> mechanism
>> > > could result in a shorter session that would be obvious to the client
>> and
>> > > the server.  In particular, for OAUTHBEARER, the session lifetime
>> will be
>> > > the token lifetime, which the client and server will both know.  Is
>> > there a
>> > > reason to communicate this value back to the client when the client
>> > already
>> > > knows it?  It's an extra network round-trip, and since the SASL
>> > round-trips
>> > > are defined by the spec I'm not certain adding an extra round-trip is
>> > > acceptable.  Even if we decide we can add it, it helps with latency
>> if we
>> > > don't.
>> > >
>> > > Kerberos may be a bit different -- I don't know if the broker can
>> learn
>> > > the session lifetime.  If it can then the same thing holds -- both
>> client
>> > > and server will know the session lifetime and there is no reason to
>> > > communicate it back.  If the server can't learn the lifetime then I
>> don't
>> > > think adding an extra SASL_AUTHENTICATE round trip is going to help,
>> > anyway.
>> > >
>> > > Also, by communicating the max session lifetime in the SASL_HANDSHAKE
>> > > response, both OAUTHBEARER and GSSAPI clients will be able to know
>> before
>> > > sending any SASL_AUTHENTICATE requests whether their credential
>> violates
>> > > the maximum.  This allows a behaving client to give a good error
>> message.
>> > > A malicious client would ignore the value and send a longer-lived
>> > > credential, and then that would be rejected on the server side.
>> > >
>> > > I'm still good with ExpiringCredential not needing to be public.
>> > >
>> > > What do you think?
>> > >
>> > > Ron
>> > >
>> > > Ron
>> > >
>> > > Ron
>> > >
>> > > On Mon, Sep 17, 2018 at 9:13 AM Ron Dagostino <rndg...@gmail.com>
>> wrote:
>> > >
>> > >> Hi Rajini.  I've updated the KIP to reflect the decision to fully
>> > support
>> > >> this for all SASL mechanisms and to not require the
>> ExpiringCredential
>> > >> interface to be public.
>> > >>
>> > >> Ron
>> > >>
>> > >> On Mon, Sep 17, 2018 at 6:55 AM Ron Dagostino <rndg...@gmail.com>
>> > wrote:
>> > >>
>> > >>> Actually, I think the metric remains the same assuming the
>> > >>> authentication fails when the token  lifetime is too long.  Negative
>> > max
>> > >>> config on server counts what would have been killed because maybe
>> > client
>> > >>> re-auth is not turned on; positive max enables the kill, which is
>> > counted
>> > >>> by a second metric as proposed.  The current proposal had already
>> > stated
>> > >>> that a non-zero value would disallow an authentication with a token
>> > that
>> > >>> has too large a lifetime, and that is still the case, I think.  But
>> > let’s
>> > >>> make sure we are on the same page about all this.
>> > >>>
>> > >>> Ron
>> > >>>
>> > >>> > On Sep 17, 2018, at 6:42 AM, Ron Dagostino <rndg...@gmail.com>
>> > wrote:
>> > >>> >
>> > >>> > Hi Rajini.  I see what you are saying.  The background login
>> refresh
>> > >>> thread does have to factor into the decision for OAUTHBEARER because
>> > there
>> > >>> is no new token to re-authenticate with until that refresh succeeds
>> > >>> (similarly with Kerberos), but I think you are right that the
>> interface
>> > >>> doesn’t necessarily have to be public.  The server does decide the
>> > time for
>> > >>> PLAIN and the SCRAM-related mechanisms under the current proposal,
>> but
>> > it
>> > >>> is done in SaslHandshakeResponse, and at that point the server won’t
>> > have
>> > >>> any token yet; making it happen via SaslAuthenticateRequest at the
>> > very end
>> > >>> does allow the server to know everything for all mechanisms.
>> > >>> >
>> > >>> > I see two potential issues to discuss.  First, the server must
>> > >>> communicate a time that exceeds the (token or ticket) refresh period
>> > on the
>> > >>> client.  Assuming it communicates the expiration time of the
>> > token/ticket,
>> > >>> that’s the best it can do.  So I think that’ll be fine.
>> > >>> >
>> > >>> > The second issue is what happens if the server is configured to
>> > accept
>> > >>> a max value — say, an hour — and the token is good for longer.  I
>> > assumed
>> > >>> that the client should not be allowed to authenticate in the first
>> > place
>> > >>> because it could then simply re-authenticate with the same token
>> after
>> > an
>> > >>> hour, which defeats the motivations for this KIP.  So do we agree
>> the
>> > max
>> > >>> token lifetime allowed will be enforced at authentication time?
>> > Assuming
>> > >>> so, then we need to discuss migration.
>> > >>> >
>> > >>> > The current proposal includes a metric that can be used to
>> identify
>> > if
>> > >>> an OAUTHBEARER client is misconfigured — the number of connections
>> that
>> > >>> would have been killed (could be non-zero when the configured max
>> is a
>> > >>> negative number). Do we still want this type of an indication, and
>> if
>> > so,
>> > >>> is it still done the same way — a negative number for max, but
>> instead
>> > of
>> > >>> counting the number of kills that would have been done it counts the
>> > number
>> > >>> of authentications that would have been failed?
>> > >>> >
>> > >>> > Ron
>> > >>> >
>> > >>> >> On Sep 17, 2018, at 6:06 AM, Rajini Sivaram <
>> > rajinisiva...@gmail.com>
>> > >>> wrote:
>> > >>> >>
>> > >>> >> Hi Ron,
>> > >>> >>
>> > >>> >> Thinking a bit more about other SASL mechanisms, I think the
>> issue
>> > is
>> > >>> that
>> > >>> >> in the current proposal, clients decide the re-authentication
>> > period.
>> > >>> For
>> > >>> >> mechanisms like PLAIN or SCRAM, we would actually want server to
>> > >>> determine
>> > >>> >> the re-authentication interval. For example, the PLAIN or SCRAM
>> > >>> database
>> > >>> >> could specify the expiry time for each principal, or the broker
>> > could
>> > >>> be
>> > >>> >> configured with a single expiry time for all principals of that
>> > >>> mechanism.
>> > >>> >> For OAuth, brokers do know the expiry time. Haven't figured out
>> what
>> > >>> to do
>> > >>> >> with Kerberos, but in any case for broker-side connection
>> > termination
>> > >>> on
>> > >>> >> expiry, we need the broker to know/decide the expiry time. So I
>> > would
>> > >>> like
>> > >>> >> to suggest a slightly different approach to managing
>> > >>> re-authentications.
>> > >>> >>
>> > >>> >> 1) Instead of changing SASL_HANDSHAKE version number, we bump up
>> > >>> >> SASL_AUTHENTICATE version number.
>> > >>> >> 2) In the final SASL_AUTHENTICATE response, broker returns the
>> > expiry
>> > >>> time
>> > >>> >> of this authenticated session. This is the interval after which
>> > >>> broker will
>> > >>> >> terminate the connection If it wasn't re-authenticated.
>> > >>> >> 3) We no longer need the public interface change to add `
>> > >>> >> org.apache.kafka.common.security.expiring.ExpiringCredential` for
>> > the
>> > >>> >> client-side. Instead we schedule the next re-authentication on
>> the
>> > >>> client
>> > >>> >> based on the expiry time provided by the server (some time
>> earlier
>> > >>> than the
>> > >>> >> expiry).
>> > >>> >> 4) If client uses SASL_AUTHENTICATE v0, broker will not return
>> > expiry
>> > >>> time.
>> > >>> >> The connection will be terminated if that feature is enabled (the
>> > >>> same code
>> > >>> >> path as client failing to re-authenticte on time).
>> > >>> >>
>> > >>> >> Thoughts?
>> > >>> >>
>> > >>> >>> On Sat, Sep 15, 2018 at 3:06 AM, Ron Dagostino <
>> rndg...@gmail.com>
>> > >>> wrote:
>> > >>> >>>
>> > >>> >>> Hi everyone.  I've updated the KIP to reflect all discussion to
>> > date,
>> > >>> >>> including the decision to go with the low-level approach.  This
>> > >>> latest
>> > >>> >>> version of the KIP includes the above "
>> connections.max.reauth.ms"
>> > >>> >>> proposal,
>> > >>> >>> which I know has not been discussed yet.  It mentions new
>> metrics,
>> > >>> some of
>> > >>> >>> which may not have been discussed either (and names are missing
>> > from
>> > >>> some
>> > >>> >>> of them).  Regardless, this new version is the closest yet to a
>> > >>> version
>> > >>> >>> that can be put to a vote next week.
>> > >>> >>>
>> > >>> >>> Ron
>> > >>> >>>
>> > >>> >>>> On Fri, Sep 14, 2018 at 8:59 PM Ron Dagostino <
>> rndg...@gmail.com>
>> > >>> wrote:
>> > >>> >>>>
>> > >>> >>>> Minor correction: I'm proposing "connections.max.reauth.ms" as
>> > the
>> > >>> >>>> broker-side configuration property, not "
>> > >>> >>>> connections.max.expired.credentials.ms".
>> > >>> >>>>
>> > >>> >>>> Ron
>> > >>> >>>>
>> > >>> >>>>> On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino <
>> rndg...@gmail.com
>> > >
>> > >>> wrote:
>> > >>> >>>>>
>> > >>> >>>>> Hi Rajini.  I'm going to choose *
>> > >>> connections.max.expired.credentials.ms
>> > >>> >>>>> <http://connections.max.expired.credentials.ms>* as the
>> option
>> > >>> name
>> > >>> >>>>> because it is consistent with the comment I made in the
>> section
>> > >>> about
>> > >>> >>> how
>> > >>> >>>>> to get the client and server to agree on credential lifetime:
>> > >>> >>>>>
>> > >>> >>>>> "We could add a new API call so that clients could ask servers
>> > for
>> > >>> the
>> > >>> >>>>> lifetime they use, or we could extend the
>> > >>> SaslHandshakeRequest/Response
>> > >>> >>> API
>> > >>> >>>>> call to include that information in the server's response –
>> the
>> > >>> client
>> > >>> >>>>> would then adopt that value"
>> > >>> >>>>>
>> > >>> >>>>>
>> > >>> >>>>> We set the config option value on the broker (with the "
>> > >>> >>>>> listener.name.mechanism." prefix), and we will return the
>> > >>> configured
>> > >>> >>>>> value in the SaslHandshakeResponse (requiring a wire format
>> > change
>> > >>> in
>> > >>> >>>>> addition to a version bump).  The value (referred to as "X"
>> > below)
>> > >>> can
>> > >>> >>> be
>> > >>> >>>>> negative, zero, or positive and is to be interpreted as
>> follows:
>> > >>> >>>>>
>> > >>> >>>>> X = 0: Fully disable.  The server will respond to
>> > >>> re-authentications,
>> > >>> >>> but
>> > >>> >>>>> it won't kill any connections due to expiration, and it won't
>> > track
>> > >>> >>> either
>> > >>> >>>>> of the 2 metrics mentioned below.
>> > >>> >>>>>
>> > >>> >>>>> Now, a couple of definitions for when X != 0:
>> > >>> >>>>>
>> > >>> >>>>> 1) We define the *maximum allowed expiration time* to be the
>> time
>> > >>> >>>>> determined by the point when (re-)authentication occurs plus
>> |X|
>> > >>> >>>>> milliseconds.
>> > >>> >>>>> 2) We define the *requested expiration time* to be the maximum
>> > >>> allowed
>> > >>> >>>>> expiration time except for the case of an OAuth bearer token,
>> in
>> > >>> which
>> > >>> >>> case
>> > >>> >>>>> it is the point at which the token expires.  (Kerberos
>> presumably
>> > >>> also
>> > >>> >>>>> specifies a ticket lifetime, but frankly I am not in a
>> position
>> > to
>> > >>> do
>> > >>> >>> any
>> > >>> >>>>> Kerberos-related coding in time for a 2.1.0 release and would
>> > >>> prefer to
>> > >>> >>>>> ignore this piece of information for this KIP -- would it be
>> > >>> acceptable
>> > >>> >>> to
>> > >>> >>>>> have someone else add it later?).
>> > >>> >>>>>
>> > >>> >>>>> Based on these definitions, we define the behavior as follows:
>> > >>> >>>>>
>> > >>> >>>>> *X > 0: Fully enable*
>> > >>> >>>>> A) The server will reject any authentication/re-authentication
>> > >>> attempt
>> > >>> >>>>> when the requested expiration time is after the maximum
>> allowed
>> > >>> >>> expiration
>> > >>> >>>>> time (which could only happen with an OAuth bearer token,
>> > assuming
>> > >>> we
>> > >>> >>> skip
>> > >>> >>>>> dealing with Kerberos for now).
>> > >>> >>>>> B) The server will kill connections that are used beyond the
>> > >>> requested
>> > >>> >>>>> expiration time.
>> > >>> >>>>> C) A broker metric will be maintained that documents the
>> number
>> > >>> >>>>> connections killed by the broker.  This metric will be
>> non-zero
>> > if
>> > >>> a
>> > >>> >>> client
>> > >>> >>>>> is connecting to the broker with re-authentication either
>> > >>> unavailable
>> > >>> >>> (i.e.
>> > >>> >>>>> an older client) or disabled.
>> > >>> >>>>>
>> > >>> >>>>> *X < 0: Partially enable*
>> > >>> >>>>> A) Same as above: the server will reject any
>> > >>> >>>>> authentication/re-authentication attempt when the requested
>> > >>> expiration
>> > >>> >>> time
>> > >>> >>>>> is after the maximum allowed expiration time (which could only
>> > >>> happen
>> > >>> >>> with
>> > >>> >>>>> an OAuth bearer token, assuming we skip dealing with Kerberos
>> for
>> > >>> now).
>> > >>> >>>>> B) The server will **not** kill connections that are used
>> beyond
>> > >>> the
>> > >>> >>>>> requested expiration time.
>> > >>> >>>>> C) A broker metric will be maintained that documents the
>> number
>> > of
>> > >>> API
>> > >>> >>>>> requests unrelated to re-authentication that are made over a
>> > >>> connection
>> > >>> >>>>> beyond the requested expiration time.  This metric will be
>> > helpful
>> > >>> for
>> > >>> >>>>> operators to ensure that all clients are properly upgraded and
>> > >>> >>>>> re-authenticating before fully enabling the server-side
>> > >>> >>>>> expired-connection-kill functionality (i.e. by changing the
>> value
>> > >>> from a
>> > >>> >>>>> negative number to a positive number): this metric will be
>> zero
>> > >>> across
>> > >>> >>> all
>> > >>> >>>>> brokers when it is safe to fully enable the server-side
>> feature.
>> > >>> >>>>>
>> > >>> >>>>> Thoughts?
>> > >>> >>>>>
>> > >>> >>>>> Ron
>> > >>> >>>>>
>> > >>> >>>>> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram <
>> > >>> >>> rajinisiva...@gmail.com>
>> > >>> >>>>> wrote:
>> > >>> >>>>>
>> > >>> >>>>>> It will be good to shorten the config name. We have a config
>> > >>> named `
>> > >>> >>>>>> connections.max.idle.ms`. We could add something like `
>> > >>> >>>>>> connections.max.expired.credentials.ms`. As you said, it
>> would
>> > be
>> > >>> >>>>>> prefixed
>> > >>> >>>>>> with listener and SASL mechanism name. We should be able to
>> > >>> support
>> > >>> >>>>>> connection termination in future even with SSL, so perhaps we
>> > >>> don't
>> > >>> >>> want
>> > >>> >>>>>> the `sasl.server.` prefix (we just need to validate whether
>> the
>> > >>> config
>> > >>> >>> is
>> > >>> >>>>>> supported for the protocol). You can choose whether a boolean
>> > >>> flag or a
>> > >>> >>>>>> time interval makes more sense.
>> > >>> >>>>>>
>> > >>> >>>>>> For the client-side, the KIP explains how to make it work for
>> > >>> other
>> > >>> >>>>>> mechanisms and we can leave it that. Not convinced about
>> > >>> server-side
>> > >>> >>>>>> though. It seems to me that we probably would require API
>> > changes
>> > >>> to
>> > >>> >>> make
>> > >>> >>>>>> it work. Basically the proposed approach works only for
>> > >>> OAUTHBEARER.
>> > >>> >>>>>> Since
>> > >>> >>>>>> we have to bump up SaslHandshakeRequest version in this KIP,
>> it
>> > >>> will be
>> > >>> >>>>>> good to work out if we need to change the request or flow to
>> > make
>> > >>> this
>> > >>> >>>>>> work
>> > >>> >>>>>> for other mechanisms. I haven't figured out exactly what is
>> > >>> needed, but
>> > >>> >>>>>> will think about it and get back next week. In the meantime,
>> you
>> > >>> can
>> > >>> >>> get
>> > >>> >>>>>> the KIP up-to-date with the config, migration path etc. and
>> get
>> > it
>> > >>> >>> ready
>> > >>> >>>>>> to
>> > >>> >>>>>> start vote next week.
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino <
>> > rndg...@gmail.com
>> > >>> >
>> > >>> >>>>>> wrote:
>> > >>> >>>>>>
>> > >>> >>>>>>> Hi Rajini.
>> > >>> >>>>>>>
>> > >>> >>>>>>> Agreed, we will bump the SaslHandshakeRequest API number (no
>> > wire
>> > >>> >>>>>> format
>> > >>> >>>>>>> change, of course).
>> > >>> >>>>>>>
>> > >>> >>>>>>> Agreed, responding to re-authentication will always be
>> enabled
>> > >>> on the
>> > >>> >>>>>>> broker since it is initiated by the client.
>> > >>> >>>>>>>
>> > >>> >>>>>>> Agreed, connection termination by the broker will be opt-in.
>> > I'm
>> > >>> >>>>>> thinking
>> > >>> >>>>>>> *sasl.server.disconnect.expired.credential.connections.
>> > enable*,
>> > >>> >>>>>> prefixed
>> > >>> >>>>>>> with listener and SASL mechanism name in lower-case so it
>> can
>> > be
>> > >>> >>>>>> opt-in at
>> > >>> >>>>>>> the granularity of a SASL mechanism; for example, "
>> > >>> >>>>>>> listener.name.sasl_ssl.oauthbearer.sasl.server.
>> > >>> >>>>>>> disconnect.expired.credential.connections.enable=true
>> > >>> >>>>>>> ".  It is long, so if you have a preference for a shorter
>> name
>> > >>> let me
>> > >>> >>>>>> know
>> > >>> >>>>>>> and we'll go with it instead.
>> > >>> >>>>>>>
>> > >>> >>>>>>> Agreed, additional metrics will be helpful.  I'll add them
>> to
>> > the
>> > >>> >>> KIP.
>> > >>> >>>>>>>
>> > >>> >>>>>>> <<<need to work out exactly how this works with all SASL
>> > >>> mechanisms
>> > >>> >>>>>>> <<<not sure we have the data required on the broker or a
>> way of
>> > >>> >>>>>>> <<< extending the [GSSAPI] mechanism
>> > >>> >>>>>>> Not sure I agree here.  The paragraph I added to the KIP
>> > >>> describes
>> > >>> >>> how
>> > >>> >>>>>> I
>> > >>> >>>>>>> think this can be done.  Given that description, do you
>> still
>> > >>> feel
>> > >>> >>>>>> Kerberos
>> > >>> >>>>>>> will not be possible?  If Kerberos presents a significant
>> > hurdle
>> > >>> >>> then I
>> > >>> >>>>>>> don't think that should prevent us from doing it for other
>> > >>> mechanisms
>> > >>> >>>>>> -- I
>> > >>> >>>>>>> would rather state that it isn't supported with GSSAPI due
>> to
>> > >>> >>>>>> limitations
>> > >>> >>>>>>> in Kerberos itself than not have it for OAUTHBEARER, for
>> > example.
>> > >>> >>> And
>> > >>> >>>>>>> right now I don't think we need more than a high-level
>> > >>> description of
>> > >>> >>>>>> how
>> > >>> >>>>>>> this could be done.  In other words, I think we have this
>> > >>> covered,
>> > >>> >>>>>> unless
>> > >>> >>>>>>> there is a fundamental problem due to Kerberos not making
>> data
>> > >>> (i.e.
>> > >>> >>>>>> ticket
>> > >>> >>>>>>> expiration) available on the server.   I want to submit this
>> > KIP
>> > >>> for
>> > >>> >>> a
>> > >>> >>>>>> vote
>> > >>> >>>>>>> early next week in the hope of having it accepted by the
>> > Monday,
>> > >>> 9/24
>> > >>> >>>>>>> deadline for the 2.1.0 release, and if that happens I
>> believe I
>> > >>> can
>> > >>> >>>>>> get a
>> > >>> >>>>>>> PR into really good shape soon thereafter (I'm working on it
>> > >>> now).
>> > >>> >>>>>>>
>> > >>> >>>>>>> Ron
>> > >>> >>>>>>>
>> > >>> >>>>>>>
>> > >>> >>>>>>>
>> > >>> >>>>>>>
>> > >>> >>>>>>>
>> > >>> >>>>>>>
>> > >>> >>>>>>> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <
>> > >>> >>>>>> rajinisiva...@gmail.com>
>> > >>> >>>>>>> wrote:
>> > >>> >>>>>>>
>> > >>> >>>>>>>> Hi Ron,
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> 1) Yes, we should update the version of
>> > `SaslHandshakeRequest`.
>> > >>> >>>>>> Clients
>> > >>> >>>>>>>> would attempt to re-authenticate only if broker's
>> > >>> >>>>>> SaslHandshakeRequest
>> > >>> >>>>>>>> version >=2.
>> > >>> >>>>>>>> 2) I think we should enable broker's re-authentication and
>> > >>> >>> connection
>> > >>> >>>>>>>> termination code regardless of client version.
>> > Re-authentication
>> > >>> >>>>>> could be
>> > >>> >>>>>>>> always enabled since it is triggered only by clients and
>> > broker
>> > >>> >>> could
>> > >>> >>>>>>> just
>> > >>> >>>>>>>> handle it. Connection termination should be opt-in, but
>> should
>> > >>> work
>> > >>> >>>>>> with
>> > >>> >>>>>>>> clients of any version. If turned on and clients are of an
>> > older
>> > >>> >>>>>> version,
>> > >>> >>>>>>>> this would simply force reconnection, which should be ok.
>> > >>> >>> Additional
>> > >>> >>>>>>>> metrics to monitor expired connections may be useful
>> anyway.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> We also need to work out exactly how this works with all
>> SASL
>> > >>> >>>>>> mechanisms.
>> > >>> >>>>>>>> In particular, we have ensure that we can make it work with
>> > >>> >>> Kerberos
>> > >>> >>>>>>> since
>> > >>> >>>>>>>> we use the implementation from the JRE and I am not sure we
>> > have
>> > >>> >>> the
>> > >>> >>>>>> data
>> > >>> >>>>>>>> required on the broker or a way of extending the mechanism.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> Thoughts?
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <
>> > >>> rndg...@gmail.com>
>> > >>> >>>>>>> wrote:
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>> Hi Rajini.  I'm thinking about how we deal with migration.
>> > >>>  For
>> > >>> >>>>>>> example,
>> > >>> >>>>>>>>> let's say we have an existing 2.0.0 cluster using
>> > >>> >>> SASL/OAUTHBEARER
>> > >>> >>>>>> and
>> > >>> >>>>>>> we
>> > >>> >>>>>>>>> want to use this feature.  The desired end state is to
>> have
>> > all
>> > >>> >>>>>> clients
>> > >>> >>>>>>>> and
>> > >>> >>>>>>>>> brokers migrated to a version that supports the feature
>> (call
>> > >>> it
>> > >>> >>>>>> 2.x)
>> > >>> >>>>>>>> with
>> > >>> >>>>>>>>> the feature turned on.  We need to document the supported
>> > >>> path(s)
>> > >>> >>>>>> to
>> > >>> >>>>>>> this
>> > >>> >>>>>>>>> end state.
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> Here are a couple of scenarios with implications:
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> 1) *Migrate client to 2.x and turn the feature on but
>> broker
>> > is
>> > >>> >>>>>> still
>> > >>> >>>>>>> at
>> > >>> >>>>>>>>> 2.0.*  In this case the client is going to get an error
>> when
>> > it
>> > >>> >>>>>> sends
>> > >>> >>>>>>> the
>> > >>> >>>>>>>>> SaslHandshakeRequest.  We want to avoid this.  It seems
>> to me
>> > >>> the
>> > >>> >>>>>>> client
>> > >>> >>>>>>>>> needs to know if the broker has been upgraded to the
>> > necessary
>> > >>> >>>>>> version
>> > >>> >>>>>>>>> before trying to re-authenticate, which makes me believe
>> we
>> > >>> need
>> > >>> >>> to
>> > >>> >>>>>>> bump
>> > >>> >>>>>>>>> the API version number in 2.x and the client has to check
>> to
>> > >>> see
>> > >>> >>>>>> if the
>> > >>> >>>>>>>> max
>> > >>> >>>>>>>>> version the broker supports meets a minimum standard
>> before
>> > >>> >>> trying
>> > >>> >>>>>> to
>> > >>> >>>>>>>>> re-authenticate.  Do you agree?
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> 2) *Migrate broker to 2.x and turn the feature on but
>> client
>> > is
>> > >>> >>>>>> still
>> > >>> >>>>>>> at
>> > >>> >>>>>>>>> 2.0*.  In this case the broker is going to end up killing
>> > >>> >>>>>> connections
>> > >>> >>>>>>>>> periodically.   Again, we want to avoid this.  It's
>> tempting
>> > to
>> > >>> >>> say
>> > >>> >>>>>>>> "don't
>> > >>> >>>>>>>>> do this" but I wonder if we can require installations to
>> > >>> upgrade
>> > >>> >>>>>> all
>> > >>> >>>>>>>>> clients before turning the feature on at the brokers.
>> > >>> Certainly
>> > >>> >>>>>> we can
>> > >>> >>>>>>>> say
>> > >>> >>>>>>>>> "don't do this" for inter-broker clients -- in other
>> words,
>> > we
>> > >>> >>> can
>> > >>> >>>>>> say
>> > >>> >>>>>>>> that
>> > >>> >>>>>>>>> all brokers in a cluster should be upgraded before the
>> > feature
>> > >>> is
>> > >>> >>>>>>> turned
>> > >>> >>>>>>>> on
>> > >>> >>>>>>>>> for any one of them -- but I don't know about our ability
>> to
>> > >>> say
>> > >>> >>>>>> it for
>> > >>> >>>>>>>>> non-broker clients.
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> Now we consider the cases where both the brokers and the
>> > >>> clients
>> > >>> >>>>>> have
>> > >>> >>>>>>>> been
>> > >>> >>>>>>>>> upgraded to 2.x.  When and where should the feature be
>> turned
>> > >>> on?
>> > >>> >>>>>> The
>> > >>> >>>>>>>>> inter-broker case is interesting because I don't think
>> think
>> > we
>> > >>> >>> can
>> > >>> >>>>>>>> require
>> > >>> >>>>>>>>> installations to restart every broker with a new config
>> where
>> > >>> the
>> > >>> >>>>>>> feature
>> > >>> >>>>>>>>> is turned on at the same time.  Do you agree that we
>> cannot
>> > >>> >>> require
>> > >>> >>>>>>> this
>> > >>> >>>>>>>>> and therefore must support installations restarting
>> brokers
>> > >>> with
>> > >>> >>> a
>> > >>> >>>>>> new
>> > >>> >>>>>>>>> config at whatever pace they need -- which may be quite
>> slow
>> > >>> >>>>>> relative
>> > >>> >>>>>>> to
>> > >>> >>>>>>>>> token lifetimes?  Assuming you do agree, then there is
>> going
>> > to
>> > >>> >>> be
>> > >>> >>>>>> the
>> > >>> >>>>>>>> case
>> > >>> >>>>>>>>> where some brokers are going to have the feature turned on
>> > and
>> > >>> >>> some
>> > >>> >>>>>>>> clients
>> > >>> >>>>>>>>> (definitely inter-broker clients at a minimum) are going
>> to
>> > >>> have
>> > >>> >>>>>> the
>> > >>> >>>>>>>>> feature turned off.
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> The above discussion assumes a single config on the broker
>> > side
>> > >>> >>>>>> that
>> > >>> >>>>>>>> turns
>> > >>> >>>>>>>>> on both the inter-broker client re-authentication feature
>> as
>> > >>> well
>> > >>> >>>>>> as
>> > >>> >>>>>>> the
>> > >>> >>>>>>>>> server-side expired-connection-kill feature, but now I'm
>> > >>> thinking
>> > >>> >>>>>> we
>> > >>> >>>>>>> need
>> > >>> >>>>>>>>> the ability to turn these features on independently, plus
>> > maybe
>> > >>> >>> we
>> > >>> >>>>>>> need a
>> > >>> >>>>>>>>> way to monitor the number of "active, expired"
>> connections to
>> > >>> the
>> > >>> >>>>>>> server
>> > >>> >>>>>>>> so
>> > >>> >>>>>>>>> that operators can be sure that all clients have been
>> > >>> >>>>>> upgraded/enabled
>> > >>> >>>>>>>>> before turning on the server-side expired-connection-kill
>> > >>> >>> feature.
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> So the migration plan would be as follows:
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> 1) Upgrade all brokers to 2.x.
>> > >>> >>>>>>>>> 2) After all brokers are upgraded, turn on
>> re-authentication
>> > >>> for
>> > >>> >>>>>> all
>> > >>> >>>>>>>>> brokers at whatever rate is desired -- just eventually, at
>> > some
>> > >>> >>>>>> point,
>> > >>> >>>>>>>> get
>> > >>> >>>>>>>>> the client-side feature turned on for all brokers so that
>> > >>> >>>>>> inter-broker
>> > >>> >>>>>>>>> connections are re-authenticating.
>> > >>> >>>>>>>>> 3) In parallel with (1) and (2) above, upgrade clients to
>> 2.x
>> > >>> and
>> > >>> >>>>>> turn
>> > >>> >>>>>>>>> their re-authentication feature on.  Clients will check
>> the
>> > API
>> > >>> >>>>>> version
>> > >>> >>>>>>>> and
>> > >>> >>>>>>>>> only re-authenticate to a broker that has also been
>> upgraded
>> > >>> >>> (note
>> > >>> >>>>>> that
>> > >>> >>>>>>>> the
>> > >>> >>>>>>>>> ability of a broker to respond to a re-authentication
>> cannot
>> > be
>> > >>> >>>>>> turned
>> > >>> >>>>>>>> off
>> > >>> >>>>>>>>> -- it is always on beginning with version 2.x, and it just
>> > sits
>> > >>> >>>>>> there
>> > >>> >>>>>>>> doing
>> > >>> >>>>>>>>> nothing if it isn't exercised by an enabled client)
>> > >>> >>>>>>>>> 4) After (1), (2), and (3) are complete, check the 2.x
>> broker
>> > >>> >>>>>> metrics
>> > >>> >>>>>>> to
>> > >>> >>>>>>>>> confirm that there are no "active, expired" connections.
>> > Once
>> > >>> >>> you
>> > >>> >>>>>> are
>> > >>> >>>>>>>>> satisfied that (1), (2), and (3) are indeed complete you
>> can
>> > >>> >>>>>> enable the
>> > >>> >>>>>>>>> server-side expired-connection-kill feature on each broker
>> > via
>> > >>> a
>> > >>> >>>>>>> restart
>> > >>> >>>>>>>> at
>> > >>> >>>>>>>>> whatever pace is desired.
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> Comments?
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> Ron
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <
>> > >>> rndg...@gmail.com
>> > >>> >>>>
>> > >>> >>>>>>> wrote:
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>>> Ok, I am tempted to just say we go with the low-level
>> > approach
>> > >>> >>>>>> since
>> > >>> >>>>>>> it
>> > >>> >>>>>>>>> is
>> > >>> >>>>>>>>>> the quickest and seems to meet the clear requirements. We
>> > can
>> > >>> >>>>>> always
>> > >>> >>>>>>>>> adjust
>> > >>> >>>>>>>>>> later if we get to clarity on other requirements or we
>> > decide
>> > >>> >>> we
>> > >>> >>>>>> need
>> > >>> >>>>>>>> to
>> > >>> >>>>>>>>>> approach it differently for whatever reason.  But in the
>> > >>> >>>>>> meantime,
>> > >>> >>>>>>>> before
>> > >>> >>>>>>>>>> fully committing to this decision, I would appreciate
>> > another
>> > >>> >>>>>>>> perspective
>> > >>> >>>>>>>>>> if someone has one.
>> > >>> >>>>>>>>>>
>> > >>> >>>>>>>>>> Ron
>> > >>> >>>>>>>>>>
>> > >>> >>>>>>>>>> On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <
>> > >>> >>>>>>>> rajinisiva...@gmail.com>
>> > >>> >>>>>>>>>> wrote:
>> > >>> >>>>>>>>>>
>> > >>> >>>>>>>>>>> Hi Ron,
>> > >>> >>>>>>>>>>>
>> > >>> >>>>>>>>>>> Yes, I would leave out retries from this KIP for now. In
>> > the
>> > >>> >>>>>> future,
>> > >>> >>>>>>>> if
>> > >>> >>>>>>>>>>> there is a requirement for supporting retries, we can
>> > >>> consider
>> > >>> >>>>>> it. I
>> > >>> >>>>>>>>> think
>> > >>> >>>>>>>>>>> we can support retries with either approach if we needed
>> > to,
>> > >>> >>>>>> but it
>> > >>> >>>>>>>>> would
>> > >>> >>>>>>>>>>> be better to do it along with other changes required to
>> > >>> >>> support
>> > >>> >>>>>>>>>>> authentication servers that are not highly available.
>> > >>> >>>>>>>>>>>
>> > >>> >>>>>>>>>>> For maintainability, I am biased, so it will be good to
>> get
>> > >>> >>> the
>> > >>> >>>>>>>>>>> perspective
>> > >>> >>>>>>>>>>> of others in the community :-)
>> > >>> >>>>>>>>>>>
>> > >>> >>>>>>>>>>> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <
>> > >>> >>>>>> rndg...@gmail.com>
>> > >>> >>>>>>>>> wrote:
>> > >>> >>>>>>>>>>>
>> > >>> >>>>>>>>>>>> Hi Rajini.  Here is some feedback/some things I thought
>> > of.
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> First, I just realized that from a timing perspective
>> > that I
>> > >>> >>>>>> am
>> > >>> >>>>>>> not
>> > >>> >>>>>>>>> sure
>> > >>> >>>>>>>>>>>> retry is going to be necessary.  The background login
>> > >>> >>> refresh
>> > >>> >>>>>>> thread
>> > >>> >>>>>>>>>>>> triggers re-authentication when it refreshes the
>> client's
>> > >>> >>>>>>>> credential.
>> > >>> >>>>>>>>>>> The
>> > >>> >>>>>>>>>>>> OAuth infrastructure has to be available in order for
>> this
>> > >>> >>>>>> refresh
>> > >>> >>>>>>>> to
>> > >>> >>>>>>>>>>>> succeed (the background thread repeatedly retries if it
>> > >>> >>> can't
>> > >>> >>>>>>>> refresh
>> > >>> >>>>>>>>>>> the
>> > >>> >>>>>>>>>>>> credential, and that retry loop handles any temporary
>> > >>> >>>>>> outage).  If
>> > >>> >>>>>>>>>>> clients
>> > >>> >>>>>>>>>>>> are told to re-authenticate when the credential is
>> > refreshed
>> > >>> >>>>>> **and
>> > >>> >>>>>>>>> they
>> > >>> >>>>>>>>>>>> actually re-authenticate at that point** then it is
>> highly
>> > >>> >>>>>>> unlikely
>> > >>> >>>>>>>>> that
>> > >>> >>>>>>>>>>>> the OAuth infrastructure would fail within those
>> > intervening
>> > >>> >>>>>>>>>>> milliseconds.
>> > >>> >>>>>>>>>>>> So we don't need re-authentication retry in this KIP as
>> > long
>> > >>> >>>>>> as
>> > >>> >>>>>>>> retry
>> > >>> >>>>>>>>>>>> starts immediately.  The low-level prototype as
>> currently
>> > >>> >>>>>> coded
>> > >>> >>>>>>>>> doesn't
>> > >>> >>>>>>>>>>>> actually start re-authentication until the connection
>> is
>> > >>> >>>>>>>> subsequently
>> > >>> >>>>>>>>>>> used,
>> > >>> >>>>>>>>>>>> and that could take a while.  But then again, if the
>> > >>> >>>>>> connection
>> > >>> >>>>>>>> isn't
>> > >>> >>>>>>>>>>> used
>> > >>> >>>>>>>>>>>> for some period of time, then the lost value of having
>> to
>> > >>> >>>>>> abandon
>> > >>> >>>>>>>> the
>> > >>> >>>>>>>>>>>> connection is lessened anyway.  Plus, as you pointed
>> out,
>> > >>> >>>>>> OAuth
>> > >>> >>>>>>>>>>>> infrastructure is assumed to be highly available.
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> Does this makes sense, and would you be willing to say
>> > that
>> > >>> >>>>>> retry
>> > >>> >>>>>>>>> isn't
>> > >>> >>>>>>>>>>> a
>> > >>> >>>>>>>>>>>> necessary requirement?  I'm tempted but would like to
>> hear
>> > >>> >>>>>> your
>> > >>> >>>>>>>>>>> perspective
>> > >>> >>>>>>>>>>>> first.
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> Interleaving/latency and maintainability (more than
>> lines
>> > of
>> > >>> >>>>>> code)
>> > >>> >>>>>>>> are
>> > >>> >>>>>>>>>>> the
>> > >>> >>>>>>>>>>>> two remaining areas of comparison.  I did not add the
>> > >>> >>>>>>>> maintainability
>> > >>> >>>>>>>>>>> issue
>> > >>> >>>>>>>>>>>> to the KIP yet, but before I do I thought I would
>> address
>> > it
>> > >>> >>>>>> here
>> > >>> >>>>>>>>> first
>> > >>> >>>>>>>>>>> to
>> > >>> >>>>>>>>>>>> see if we can come to consensus on it because I'm not
>> > sure I
>> > >>> >>>>>> see
>> > >>> >>>>>>> the
>> > >>> >>>>>>>>>>>> high-level approach as being hard to maintain (yet -- I
>> > >>> >>> could
>> > >>> >>>>>> be
>> > >>> >>>>>>>>>>>> convinced/convince myself; we'll see).  I do want to
>> make
>> > >>> >>>>>> sure we
>> > >>> >>>>>>>> are
>> > >>> >>>>>>>>> on
>> > >>> >>>>>>>>>>>> the same page about what is required to add
>> > >>> >>> re-authentication
>> > >>> >>>>>>>> support
>> > >>> >>>>>>>>> in
>> > >>> >>>>>>>>>>>> the high-level case.  Granted, the amount is zero in
>> the
>> > >>> >>>>>> low-level
>> > >>> >>>>>>>>> case,
>> > >>> >>>>>>>>>>>> and it doesn't get any better that that, but the
>> amount in
>> > >>> >>> the
>> > >>> >>>>>>>>>>> high-level
>> > >>> >>>>>>>>>>>> case is very low -- just a few lines of code.  For
>> > example:
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> KafkaAdminClient:
>> > >>> >>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/commits/
>> > >>> >>> 4fa70f38b9
>> > >>> >>>>>>>>>>>> d33428ff98b64a3a2bd668f5f28c38#diff-
>> > >>> >>>>>>> 6869b8fccf6b098cbcb0676e8ceb26a7
>> > >>> >>>>>>>>>>>> It is the same few lines of code for KafkaConsumer,
>> > >>> >>>>>> KafkaProducer,
>> > >>> >>>>>>>>>>>> WorkerGroupMember, and TransactionMarkerChannelManager
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> The two synchronous I/O use cases are
>> > >>> >>>>>> ControllerChannelManager and
>> > >>> >>>>>>>>>>>> ReplicaFetcherBlockingSend (via ReplicaFetcherThread),
>> and
>> > >>> >>>>>> they
>> > >>> >>>>>>>>> require
>> > >>> >>>>>>>>>>> a
>> > >>> >>>>>>>>>>>> little bit more -- but not much.
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> Thoughts?
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> Ron
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <
>> > >>> >>>>>> rndg...@gmail.com>
>> > >>> >>>>>>>>>>> wrote:
>> > >>> >>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>> Thanks, Rajini.  Before I digest/respond to that,
>> here's
>> > >>> >>> an
>> > >>> >>>>>>> update
>> > >>> >>>>>>>>>>> that I
>> > >>> >>>>>>>>>>>>> just completed.
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>> I added a commit to the PR (
>> > >>> >>>>>>>>>>> https://github.com/apache/kafka/pull/5582/)
>> > >>> >>>>>>>>>>>>> that implements server-side kill of expired
>> OAUTHBEARER
>> > >>> >>>>>>>> connections.
>> > >>> >>>>>>>>>>> No
>> > >>> >>>>>>>>>>>>> tests yet since we still haven't settled on a final
>> > >>> >>> approach
>> > >>> >>>>>>>>>>> (low-level
>> > >>> >>>>>>>>>>>> vs.
>> > >>> >>>>>>>>>>>>> high-level).
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>> I also updated the KIP to reflect the latest
>> discussion
>> > >>> >>> and
>> > >>> >>>>>> PR
>> > >>> >>>>>>> as
>> > >>> >>>>>>>>>>>> follows:
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>  - Include support for brokers killing connections as
>> > >>> >>>>>> part of
>> > >>> >>>>>>>> this
>> > >>> >>>>>>>>>>> KIP
>> > >>> >>>>>>>>>>>>>  (rather than deferring it to a future KIP as was
>> > >>> >>>>>> originally
>> > >>> >>>>>>>>>>>> mentioned; the
>> > >>> >>>>>>>>>>>>>  PR now includes it as mentioned -- it was very easy
>> to
>> > >>> >>>>>> add)
>> > >>> >>>>>>>>>>>>>  - Added metrics (they will mirror existing ones
>> related
>> > >>> >>>>>> to
>> > >>> >>>>>>>>>>>>>  authentications; I have not added those to the PR)
>> > >>> >>>>>>>>>>>>>  - Updated the implementation description to reflect
>> the
>> > >>> >>>>>>> current
>> > >>> >>>>>>>>>>> state
>> > >>> >>>>>>>>>>>>>  of the PR, which is a high-level, one-size-fits-all
>> > >>> >>>>>> approach
>> > >>> >>>>>>>> (as
>> > >>> >>>>>>>>>>>> opposed to
>> > >>> >>>>>>>>>>>>>  my initial, even-higher-level approach)
>> > >>> >>>>>>>>>>>>>  - Added a "Rejected Alternative" for the first
>> version
>> > >>> >>>>>> of the
>> > >>> >>>>>>>> PR,
>> > >>> >>>>>>>>>>>>>  which injected requests directly into synchronous I/O
>> > >>> >>>>>>> clients'
>> > >>> >>>>>>>>>>> queues
>> > >>> >>>>>>>>>>>>>  - Added a "Rejected Alternative" for the low-level
>> > >>> >>>>>> approach
>> > >>> >>>>>>> as
>> > >>> >>>>>>>>>>>>>  suggested, but of course we have not formally decided
>> > >>> >>> to
>> > >>> >>>>>>> reject
>> > >>> >>>>>>>>>>> this
>> > >>> >>>>>>>>>>>>>  approach or adopt the current PR implementation.
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>> I'll think about where we stand some more before
>> > >>> >>> responding
>> > >>> >>>>>>> again.
>> > >>> >>>>>>>>>>>> Thanks
>> > >>> >>>>>>>>>>>>> for the above reply.
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>> Ron
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
>> > >>> >>>>>>>>>>> rajinisiva...@gmail.com>
>> > >>> >>>>>>>>>>>>> wrote:
>> > >>> >>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> Hi Ron,
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> Thank you for summarising, I think it covers the
>> > >>> >>>>>> differences
>> > >>> >>>>>>>>> between
>> > >>> >>>>>>>>>>> the
>> > >>> >>>>>>>>>>>>>> two approaches well.
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> A few minor points to answer the questions in there:
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> 1) When re-authetication is initiated in the Selector
>> > >>> >>>>>> during
>> > >>> >>>>>>>>> poll(),
>> > >>> >>>>>>>>>>> we
>> > >>> >>>>>>>>>>>>>> can
>> > >>> >>>>>>>>>>>>>> move an idle channel to re-authentication state. It
>> is
>> > >>> >>>>>> similar
>> > >>> >>>>>>> to
>> > >>> >>>>>>>>>>>>>> injecting
>> > >>> >>>>>>>>>>>>>> requests, but achieved by changing channel back to
>> > >>> >>>>>>> authenticating
>> > >>> >>>>>>>>>>> state.
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> 3) To clarify why I think re-authentication should
>> fit
>> > in
>> > >>> >>>>>> with
>> > >>> >>>>>>>> our
>> > >>> >>>>>>>>>>>>>> authentication design: My point was not about a
>> specific
>> > >>> >>>>>>>> connection
>> > >>> >>>>>>>>>>>> being
>> > >>> >>>>>>>>>>>>>> usable or not usable. It was about what happens at
>> the
>> > >>> >>>>>> client
>> > >>> >>>>>>> API
>> > >>> >>>>>>>>>>> level.
>> > >>> >>>>>>>>>>>>>> Our client API (producer/consumer/admin client etc.)
>> > >>> >>>>>> currently
>> > >>> >>>>>>>>> assume
>> > >>> >>>>>>>>>>>> that
>> > >>> >>>>>>>>>>>>>> a single broker authentication failure is a fatal
>> error
>> > >>> >>>>>> that is
>> > >>> >>>>>>>>> never
>> > >>> >>>>>>>>>>>>>> retried because we assume that broker only ever
>> fails an
>> > >>> >>>>>>>>>>> authentication
>> > >>> >>>>>>>>>>>>>> request if credentials are invalid. If we ever
>> decide to
>> > >>> >>>>>>> support
>> > >>> >>>>>>>>>>> cases
>> > >>> >>>>>>>>>>>>>> where broker occasionally fails an authentication
>> > request
>> > >>> >>>>>> due
>> > >>> >>>>>>> to
>> > >>> >>>>>>>> a
>> > >>> >>>>>>>>>>>>>> transient failure, we need to do more around how we
>> > >>> >>> handle
>> > >>> >>>>>>>>>>>> authentication
>> > >>> >>>>>>>>>>>>>> failures in clients. We may decide that it is ok to
>> > close
>> > >>> >>>>>> the
>> > >>> >>>>>>>>>>> connection
>> > >>> >>>>>>>>>>>>>> for authentication and not for re-authentication as
>> you
>> > >>> >>>>>>>> mentioned,
>> > >>> >>>>>>>>>>> but
>> > >>> >>>>>>>>>>>> we
>> > >>> >>>>>>>>>>>>>> need to change the way this disconnection is handled
>> by
>> > >>> >>>>>>> clients.
>> > >>> >>>>>>>> So
>> > >>> >>>>>>>>>>> IMO,
>> > >>> >>>>>>>>>>>>>> we
>> > >>> >>>>>>>>>>>>>> should either add support for transient retriable
>> > >>> >>>>>>> authentication
>> > >>> >>>>>>>>>>>> failures
>> > >>> >>>>>>>>>>>>>> properly or not retry for any scenario. Personally, I
>> > >>> >>> don't
>> > >>> >>>>>>> think
>> > >>> >>>>>>>>> we
>> > >>> >>>>>>>>>>>> would
>> > >>> >>>>>>>>>>>>>> want to retry all authentication failures even if it
>> is
>> > a
>> > >>> >>>>>>>>>>>>>> re-authentication, I think we could (at some point in
>> > >>> >>>>>> future),
>> > >>> >>>>>>>>> allow
>> > >>> >>>>>>>>>>>>>> brokers to return an error code that indicates that
>> it
>> > >>> >>> is a
>> > >>> >>>>>>>>> transient
>> > >>> >>>>>>>>>>>>>> broker-side failure rather than invalid credentials
>> and
>> > >>> >>>>>> handle
>> > >>> >>>>>>>> the
>> > >>> >>>>>>>>>>> error
>> > >>> >>>>>>>>>>>>>> differently. I see no reason at that point why we
>> > >>> >>> wouldn't
>> > >>> >>>>>>> handle
>> > >>> >>>>>>>>>>>>>> authentication and re-authentication in the same way.
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> 4) As you said, the high-level approach would be
>> bigger
>> > >>> >>>>>> than
>> > >>> >>>>>>> the
>> > >>> >>>>>>>>>>>> low-level
>> > >>> >>>>>>>>>>>>>> approach in terms of LOC. But I wouldn't be too
>> > concerned
>> > >>> >>>>>> about
>> > >>> >>>>>>>>>>> lines of
>> > >>> >>>>>>>>>>>>>> code. My bigger concern was about modularity. Our
>> > >>> >>> security
>> > >>> >>>>>> code
>> > >>> >>>>>>>> is
>> > >>> >>>>>>>>>>>> already
>> > >>> >>>>>>>>>>>>>> complex, protocols like Kerberos and SSL that we use
>> > from
>> > >>> >>>>>> the
>> > >>> >>>>>>> JRE
>> > >>> >>>>>>>>>>> make
>> > >>> >>>>>>>>>>>>>> problem diagnosis hard. Async I/O makes the
>> networking
>> > >>> >>> code
>> > >>> >>>>>>>>> complex.
>> > >>> >>>>>>>>>>> You
>> > >>> >>>>>>>>>>>>>> need to understand networking layer to work with the
>> > >>> >>>>>> security
>> > >>> >>>>>>>>> layer,
>> > >>> >>>>>>>>>>> but
>> > >>> >>>>>>>>>>>>>> the rest of the code base doesn't rely on knowledge
>> of
>> > >>> >>>>>>>>>>> network/security
>> > >>> >>>>>>>>>>>>>> layers. My main concern about the high-level
>> approach is
>> > >>> >>>>>> that
>> > >>> >>>>>>> it
>> > >>> >>>>>>>>>>> spans
>> > >>> >>>>>>>>>>>>>> these boundaries, making it harder to maintain in the
>> > >>> >>> long
>> > >>> >>>>>> run.
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav
>> Kozlovski <
>> > >>> >>>>>>>>>>>>>> stanis...@confluent.io> wrote:
>> > >>> >>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> Hi Ron, Rajini
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> Thanks for summarizing the discussion so far, Ron!
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> 1) How often do we have such long-lived connection
>> > >>> >>>>>> idleness
>> > >>> >>>>>>>> (e.g
>> > >>> >>>>>>>>>>> 5-10
>> > >>> >>>>>>>>>>>>>>> minutes) in practice?
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> 3) I agree that retries for re-authentication are
>> > >>> >>> useful.
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> 4) The interleaving of requests sounds like a great
>> > >>> >>>>>> feature
>> > >>> >>>>>>> to
>> > >>> >>>>>>>>>>> have,
>> > >>> >>>>>>>>>>>> but
>> > >>> >>>>>>>>>>>>>>> the tradeoff against code complexity is a tough
>> one. I
>> > >>> >>>>>> would
>> > >>> >>>>>>>>>>>> personally
>> > >>> >>>>>>>>>>>>>> go
>> > >>> >>>>>>>>>>>>>>> with the simpler approach since you could always add
>> > >>> >>>>>>>> interleaving
>> > >>> >>>>>>>>>>> on
>> > >>> >>>>>>>>>>>>>> top if
>> > >>> >>>>>>>>>>>>>>> the community decides the latency should be better.
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> Best,
>> > >>> >>>>>>>>>>>>>>> Stanislav
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <
>> > >>> >>>>>>>> rndg...@gmail.com
>> > >>> >>>>>>>>>>
>> > >>> >>>>>>>>>>>>>> wrote:
>> > >>> >>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> Hi everyone.  I've updated the PR to reflect the
>> > >>> >>> latest
>> > >>> >>>>>>>>>>> conclusions
>> > >>> >>>>>>>>>>>>>> from
>> > >>> >>>>>>>>>>>>>>>> this ongoing discussion.  The KIP still needs the
>> > >>> >>>>>> suggested
>> > >>> >>>>>>>>>>>> updates; I
>> > >>> >>>>>>>>>>>>>>> will
>> > >>> >>>>>>>>>>>>>>>> do that later this week.  II agree with Rajini that
>> > >>> >>>>>> some
>> > >>> >>>>>>>>>>> additional
>> > >>> >>>>>>>>>>>>>>>> feedback from the community at large would be very
>> > >>> >>>>>> helpful
>> > >>> >>>>>>> at
>> > >>> >>>>>>>>>>> this
>> > >>> >>>>>>>>>>>>>> point
>> > >>> >>>>>>>>>>>>>>> in
>> > >>> >>>>>>>>>>>>>>>> time.
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> Here's where we stand.
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> We have 2 separate implementations for
>> > >>> >>>>>> re-authenticating
>> > >>> >>>>>>>>> existing
>> > >>> >>>>>>>>>>>>>>>> connections -- a so-called "low-level" approach
>> and a
>> > >>> >>>>>>>>> (relatively
>> > >>> >>>>>>>>>>>>>>> speaking)
>> > >>> >>>>>>>>>>>>>>>> "high-level" approach -- and we agree on how they
>> > >>> >>>>>> should be
>> > >>> >>>>>>>>>>>> compared.
>> > >>> >>>>>>>>>>>>>>>> Specifically, Rajini has provided a mechanism that
>> > >>> >>>>>> works
>> > >>> >>>>>>> at a
>> > >>> >>>>>>>>>>>>>> relatively
>> > >>> >>>>>>>>>>>>>>>> low level in the stack by intercepting network
>> sends
>> > >>> >>>>>> and
>> > >>> >>>>>>>>> queueing
>> > >>> >>>>>>>>>>>>>> them up
>> > >>> >>>>>>>>>>>>>>>> while re-authentication happens; then the queued
>> > >>> >>> sends
>> > >>> >>>>>> are
>> > >>> >>>>>>>> sent
>> > >>> >>>>>>>>>>>> after
>> > >>> >>>>>>>>>>>>>>>> re-authentication succeeds, or the connection is
>> > >>> >>>>>> closed if
>> > >>> >>>>>>>>>>>>>>>> re-authentication fails.  See the prototype commit
>> at
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> https://github.com/rajinisivaram/kafka/commit/
>> > >>> >>>>>>> b9d711907ad843
>> > >>> >>>>>>>>>>>>>>> c11d17e80d6743bfb1d4e3f3fd
>> > >>> >>>>>>>>>>>>>>>> .
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> I have an implementation that works higher up in
>> the
>> > >>> >>>>>> stack;
>> > >>> >>>>>>>> it
>> > >>> >>>>>>>>>>>> injects
>> > >>> >>>>>>>>>>>>>>>> authentication requests into the KafkaClient via a
>> > >>> >>> new
>> > >>> >>>>>>> method
>> > >>> >>>>>>>>>>> added
>> > >>> >>>>>>>>>>>> to
>> > >>> >>>>>>>>>>>>>>> that
>> > >>> >>>>>>>>>>>>>>>> interface, and the implementation (i.e.
>> > >>> >>> NetworkClient)
>> > >>> >>>>>>> sends
>> > >>> >>>>>>>>> them
>> > >>> >>>>>>>>>>>> when
>> > >>> >>>>>>>>>>>>>>>> poll() is called.  The updated PR is available at
>> > >>> >>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/.
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> Here is how we think these two approaches should be
>> > >>> >>>>>>> compared.
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> 1) *When re-authentication starts*.  The low-level
>> > >>> >>>>>> approach
>> > >>> >>>>>>>>>>>> initiates
>> > >>> >>>>>>>>>>>>>>>> re-authentication only if/when the connection is
>> > >>> >>>>>> actually
>> > >>> >>>>>>>> used,
>> > >>> >>>>>>>>>>> so
>> > >>> >>>>>>>>>>>> it
>> > >>> >>>>>>>>>>>>>> may
>> > >>> >>>>>>>>>>>>>>>> start after the existing credential expires; the
>> > >>> >>>>>> current PR
>> > >>> >>>>>>>>>>>>>>> implementation
>> > >>> >>>>>>>>>>>>>>>> injects re-authentication requests into the
>> existing
>> > >>> >>>>>> flow,
>> > >>> >>>>>>>> and
>> > >>> >>>>>>>>>>>>>>>> re-authentication starts immediately regardless of
>> > >>> >>>>>> whether
>> > >>> >>>>>>> or
>> > >>> >>>>>>>>> not
>> > >>> >>>>>>>>>>>> the
>> > >>> >>>>>>>>>>>>>>>> connection is being used for something else.
>> Rajini
>> > >>> >>>>>>> believes
>> > >>> >>>>>>>>> the
>> > >>> >>>>>>>>>>>>>>> low-level
>> > >>> >>>>>>>>>>>>>>>> approach can be adjusted to re-authenticate idle
>> > >>> >>>>>>> connections
>> > >>> >>>>>>>>>>>> (Rajini,
>> > >>> >>>>>>>>>>>>>> I
>> > >>> >>>>>>>>>>>>>>>> don't see how this can be done without injecting
>> > >>> >>>>>> requests,
>> > >>> >>>>>>>>> which
>> > >>> >>>>>>>>>>> is
>> > >>> >>>>>>>>>>>>>> what
>> > >>> >>>>>>>>>>>>>>>> the high-level connection is doing, but I am
>> probably
>> > >>> >>>>>>> missing
>> > >>> >>>>>>>>>>>>>> something
>> > >>> >>>>>>>>>>>>>>> --
>> > >>> >>>>>>>>>>>>>>>> no need to go into it unless it is easy to
>> explain.)
>> > >>> >>>>>>>>>>>>>>>>
>> > >>> >>>>>>>>>>>>>>>> 2) *When requests not related to re-authentication
>> > >>> >>> are
>> > >>> >>>>>> able
>> > >>> >>>>>>>> to
>> > >>> >>>>>>>>>>> use
>> > >>> >>>>>>>>>>>> the
>> > >>> >>>>>>>>>>>>>>>> re-authenticating connection*.  The low-level
>> > >>> >>> approach
>> > >>> >>>>>>>> finishes
>> > >>> >>>>>>>>>>>>>>>> re-authentication completely before allowing
>> anything
>> > >>> >>>>>> else
>> > >>> >>>>>>> to
>> > >>> >>>>>>>>>>>> traverse
>> > >>> >>>>>
>
>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to