<<<extra field alongside errors, not in the opaque body as mentioned in the KIP Right, that was a concern I had mentioned in a follow-up email. Agree it should go alongside errors.
<<<SCRAM for delegation token based authentication where credentials have <<<an expiry time, so we do need to be able to set individual credential <<<lifetimes, where the server knows the expiry time but client may not Ah, ok, I was not aware that this case existed. Agreed, for consistency, server will always send it back to the client. <<<I was trying to avoid this altogether [SaslClient negotiated property for lifetime] Since we now agree that the server will send it back, yes, there is no need for this. <<<wasn't entirely clear about the purpose of [sasl.login.refresh.reauthenticate.enable] I think this might have been more useful when we weren't necessarily going to support all SASL mechanisms and/or when the broker was not going to advertise the fact that it supported re-authentication. You are correct, now that we support it for all SASL mechanisms and we are bumping an API version, I think it is okay to simply enable it wherever both the client and server meet the required versions. <<<wasn't entirely clear about the purpose of [connections.max.reauth.ms] <<<wasn't expecting that we would reject <<<tokens or tickets simply because they were too long-lived <<<tickets are granted by a 3rd party authority <<<Client re-authenticates even though token was not <<<refreshed. Does this matter? I was going under the assumption that it would matter, but based on your pushback I just realized that the same functionality can be implemented as part of token validation if there is a desire to limit token lifetimes to a certain max value (and the token validator has to be provided in production anyway since all we provide out-of-the-box is the unsecured validator). So I'm willing to abandon this check as part of re-authentication. I'll adjust the KIP accordingly a bit later. Thanks for the continued feedback/discussion. Ron On Mon, Sep 17, 2018 at 2:10 PM Rajini Sivaram <rajinisiva...@gmail.com> wrote: > Hi Ron, > > > *1) Is there a reason to communicate this value back to the client when the > client already knows it? It's an extra network round-trip, and since the > SASL round-tripsare defined by the spec I'm not certain adding an extra > round-trip is acceptable.* > > I wasn't suggesting an extra round-trip for propagating session lifetime. I > was expecting session lifetime to be added to the last SASL_AUTHENTICATE > response from the broker. Because SASL is a challenge-response mechanism, > SaslServer knows when the response being sent is the last one and hence can > send the session lifetime in the response (in the same way as we propagate > errors). I was expecting this to be added as an extra field alongside > errors, not in the opaque body as mentioned in the KIP. The opaque byte > array strictly conforms to the SASL mechanism wire protocol and we want to > keep it that way. > > As you have said, we don't need server to propagate session lifetime for > OAUTHBEARER since client knows the token lifetime. But server also knows > the credential lifetime and by having the server decide the lifetime, we > can use the same code path for all mechanisms. If we only have a constant > max lifetime in the server, for PLAIN and SCRAM we will end up having the > same lifetime for all credentials with no ability to set actual expiry. We > use SCRAM for delegation token based authentication where credentials have > an expiry time, so we do need to be able to set individual credential > lifetimes, where the server knows the expiry time but client may not. > > *2) I also just realized that if the client is to learn the credential > lifetime we wouldn't want to put special-case code in the Authenticator for > GSSAPI and OAUTHBEARER; we would want to expose the value generically, > probably as a negotiated property on the SaslClient instance.* > > I was trying to avoid this altogether. Client doesn't need to know > credential lifetime. Server asks the client to re-authenticate within its > session lifetime. > > 3) From the KIP, I wasn't entirely clear about the purpose of the two > configs: > > sasl.login.refresh.reauthenticate.enable: Do we need this? Client knows if > broker version supports re-authentication based on the SASL_AUTHENTICATE > version returned in ApiVersionsResponse. Client knows if broker is > configured to enable re-authentication based on session lifetime returned > in SaslAuthenticateResponse. If broker has re-authentication configured and > client supports re-authentication, you would always want re-authenticate. > So I wasn't sure why we need a config to opt in or out on the client-side. > > > connections.max.reauth.ms: We obviously need a broker-side config. Not > entirely sure about the semantics of the config that drives > re-authentication. In particular, I wasn't expecting that we would reject > tokens or tickets simply because they were too long-lived. Since tokens or > tickets are granted by a 3rd party authority, I am not sure if clients will > always have control over the lifetime. Do we need to support any more > scenarios than these: > > > A) reauth.ms=10,credential.lifetime.ms=10 : Broker sets > session.lifetime=10, > so this works. > > B) reauth.ms=10, credential.lifetime.ms=5 : Broker sets > session.lifetime=5, > so this works. > C) reauth.ms=10, credential.lifetime.ms=20 : Broker sets > session.lifetime=10. Client re-authenticates even though token was not > refreshed. Does this matter? > D) reauth.ms=Long.MAX_VALUE, credential.lifetime.ms=10: Broker sets > session.lifetime=10, client re-authenticates based on credential expiry. > E) reauth.ms=0 (default), credential.lifetime.ms=10 : Broker sets > session.lifetime=0, Broker doesn't terminate sessions, client doesn't > re-authenticate. We generate useful metrics. > F) reauth.ms=0 (default),no lifetime for credential (e.g. PLAIN): Broker > sets session.lifetime=0, Broker doesn't terminate sessions, client doesn't > re-authenticate > G) reauth.ms=10,no lifetime for credential (e.g. PLAIN) : Broker sets > session.lifetime=10. Client re-authenticates. > > I would have thought that D) is the typical scenario for OAuth/Kerberos to > respect token expiry time. G) would be typical scenario for PLAIN to force > re-authenication at regular intervals. A/B/C are useful to force > re-authentication in scenarios where you might check for credential > revocation in the server. And E/F are useful to disable re-authentication > and generate metrics (also the default behaviour useful during migration). > Have I missed something? > > > On Mon, Sep 17, 2018 at 4:27 PM, Ron Dagostino <rndg...@gmail.com> wrote: > > > Hi yet again, Rajini. I also just realized that if the client is to > learn > > the credential lifetime we wouldn't want to put special-case code in the > > Authenticator for GSSAPI and OAUTHBEARER; we would want to expose the > value > > generically, probably as a negotiated property on the SaslClient > instance. > > We might be talking about making the negotiated property key part of the > > public API. In other words, the SaslClient would be responsible for > > exposing the credential (i.e. token or ticket) lifetime at a well-known > > negotiated property name, such as "Credential.Lifetime" and putting a > Long > > value there if there is a lifetime. That well-klnown key (e.g. > > "Credential.Lifetime") would be part of the public API, right? > > > > Ron > > > > On Mon, Sep 17, 2018 at 11:03 AM Ron Dagostino <rndg...@gmail.com> > wrote: > > > > > Hi again, Rajini. After thinking about this a little while, it occurs > to > > > me that maybe the communication of max session lifetime should occur > via > > > SASL_HANDSHAKE after all. Here's why. The value communicated is the > max > > > session lifetime allowed, and the client can assume it is the the > session > > > lifetime for that particular session unless the particular SASL > mechanism > > > could result in a shorter session that would be obvious to the client > and > > > the server. In particular, for OAUTHBEARER, the session lifetime will > be > > > the token lifetime, which the client and server will both know. Is > > there a > > > reason to communicate this value back to the client when the client > > already > > > knows it? It's an extra network round-trip, and since the SASL > > round-trips > > > are defined by the spec I'm not certain adding an extra round-trip is > > > acceptable. Even if we decide we can add it, it helps with latency if > we > > > don't. > > > > > > Kerberos may be a bit different -- I don't know if the broker can learn > > > the session lifetime. If it can then the same thing holds -- both > client > > > and server will know the session lifetime and there is no reason to > > > communicate it back. If the server can't learn the lifetime then I > don't > > > think adding an extra SASL_AUTHENTICATE round trip is going to help, > > anyway. > > > > > > Also, by communicating the max session lifetime in the SASL_HANDSHAKE > > > response, both OAUTHBEARER and GSSAPI clients will be able to know > before > > > sending any SASL_AUTHENTICATE requests whether their credential > violates > > > the maximum. This allows a behaving client to give a good error > message. > > > A malicious client would ignore the value and send a longer-lived > > > credential, and then that would be rejected on the server side. > > > > > > I'm still good with ExpiringCredential not needing to be public. > > > > > > What do you think? > > > > > > Ron > > > > > > Ron > > > > > > Ron > > > > > > On Mon, Sep 17, 2018 at 9:13 AM Ron Dagostino <rndg...@gmail.com> > wrote: > > > > > >> Hi Rajini. I've updated the KIP to reflect the decision to fully > > support > > >> this for all SASL mechanisms and to not require the ExpiringCredential > > >> interface to be public. > > >> > > >> Ron > > >> > > >> On Mon, Sep 17, 2018 at 6:55 AM Ron Dagostino <rndg...@gmail.com> > > wrote: > > >> > > >>> Actually, I think the metric remains the same assuming the > > >>> authentication fails when the token lifetime is too long. Negative > > max > > >>> config on server counts what would have been killed because maybe > > client > > >>> re-auth is not turned on; positive max enables the kill, which is > > counted > > >>> by a second metric as proposed. The current proposal had already > > stated > > >>> that a non-zero value would disallow an authentication with a token > > that > > >>> has too large a lifetime, and that is still the case, I think. But > > let’s > > >>> make sure we are on the same page about all this. > > >>> > > >>> Ron > > >>> > > >>> > On Sep 17, 2018, at 6:42 AM, Ron Dagostino <rndg...@gmail.com> > > wrote: > > >>> > > > >>> > Hi Rajini. I see what you are saying. The background login > refresh > > >>> thread does have to factor into the decision for OAUTHBEARER because > > there > > >>> is no new token to re-authenticate with until that refresh succeeds > > >>> (similarly with Kerberos), but I think you are right that the > interface > > >>> doesn’t necessarily have to be public. The server does decide the > > time for > > >>> PLAIN and the SCRAM-related mechanisms under the current proposal, > but > > it > > >>> is done in SaslHandshakeResponse, and at that point the server won’t > > have > > >>> any token yet; making it happen via SaslAuthenticateRequest at the > > very end > > >>> does allow the server to know everything for all mechanisms. > > >>> > > > >>> > I see two potential issues to discuss. First, the server must > > >>> communicate a time that exceeds the (token or ticket) refresh period > > on the > > >>> client. Assuming it communicates the expiration time of the > > token/ticket, > > >>> that’s the best it can do. So I think that’ll be fine. > > >>> > > > >>> > The second issue is what happens if the server is configured to > > accept > > >>> a max value — say, an hour — and the token is good for longer. I > > assumed > > >>> that the client should not be allowed to authenticate in the first > > place > > >>> because it could then simply re-authenticate with the same token > after > > an > > >>> hour, which defeats the motivations for this KIP. So do we agree the > > max > > >>> token lifetime allowed will be enforced at authentication time? > > Assuming > > >>> so, then we need to discuss migration. > > >>> > > > >>> > The current proposal includes a metric that can be used to identify > > if > > >>> an OAUTHBEARER client is misconfigured — the number of connections > that > > >>> would have been killed (could be non-zero when the configured max is > a > > >>> negative number). Do we still want this type of an indication, and if > > so, > > >>> is it still done the same way — a negative number for max, but > instead > > of > > >>> counting the number of kills that would have been done it counts the > > number > > >>> of authentications that would have been failed? > > >>> > > > >>> > Ron > > >>> > > > >>> >> On Sep 17, 2018, at 6:06 AM, Rajini Sivaram < > > rajinisiva...@gmail.com> > > >>> wrote: > > >>> >> > > >>> >> Hi Ron, > > >>> >> > > >>> >> Thinking a bit more about other SASL mechanisms, I think the issue > > is > > >>> that > > >>> >> in the current proposal, clients decide the re-authentication > > period. > > >>> For > > >>> >> mechanisms like PLAIN or SCRAM, we would actually want server to > > >>> determine > > >>> >> the re-authentication interval. For example, the PLAIN or SCRAM > > >>> database > > >>> >> could specify the expiry time for each principal, or the broker > > could > > >>> be > > >>> >> configured with a single expiry time for all principals of that > > >>> mechanism. > > >>> >> For OAuth, brokers do know the expiry time. Haven't figured out > what > > >>> to do > > >>> >> with Kerberos, but in any case for broker-side connection > > termination > > >>> on > > >>> >> expiry, we need the broker to know/decide the expiry time. So I > > would > > >>> like > > >>> >> to suggest a slightly different approach to managing > > >>> re-authentications. > > >>> >> > > >>> >> 1) Instead of changing SASL_HANDSHAKE version number, we bump up > > >>> >> SASL_AUTHENTICATE version number. > > >>> >> 2) In the final SASL_AUTHENTICATE response, broker returns the > > expiry > > >>> time > > >>> >> of this authenticated session. This is the interval after which > > >>> broker will > > >>> >> terminate the connection If it wasn't re-authenticated. > > >>> >> 3) We no longer need the public interface change to add ` > > >>> >> org.apache.kafka.common.security.expiring.ExpiringCredential` for > > the > > >>> >> client-side. Instead we schedule the next re-authentication on the > > >>> client > > >>> >> based on the expiry time provided by the server (some time earlier > > >>> than the > > >>> >> expiry). > > >>> >> 4) If client uses SASL_AUTHENTICATE v0, broker will not return > > expiry > > >>> time. > > >>> >> The connection will be terminated if that feature is enabled (the > > >>> same code > > >>> >> path as client failing to re-authenticte on time). > > >>> >> > > >>> >> Thoughts? > > >>> >> > > >>> >>> On Sat, Sep 15, 2018 at 3:06 AM, Ron Dagostino < > rndg...@gmail.com> > > >>> wrote: > > >>> >>> > > >>> >>> Hi everyone. I've updated the KIP to reflect all discussion to > > date, > > >>> >>> including the decision to go with the low-level approach. This > > >>> latest > > >>> >>> version of the KIP includes the above "connections.max.reauth.ms > " > > >>> >>> proposal, > > >>> >>> which I know has not been discussed yet. It mentions new > metrics, > > >>> some of > > >>> >>> which may not have been discussed either (and names are missing > > from > > >>> some > > >>> >>> of them). Regardless, this new version is the closest yet to a > > >>> version > > >>> >>> that can be put to a vote next week. > > >>> >>> > > >>> >>> Ron > > >>> >>> > > >>> >>>> On Fri, Sep 14, 2018 at 8:59 PM Ron Dagostino < > rndg...@gmail.com> > > >>> wrote: > > >>> >>>> > > >>> >>>> Minor correction: I'm proposing "connections.max.reauth.ms" as > > the > > >>> >>>> broker-side configuration property, not " > > >>> >>>> connections.max.expired.credentials.ms". > > >>> >>>> > > >>> >>>> Ron > > >>> >>>> > > >>> >>>>> On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino < > rndg...@gmail.com > > > > > >>> wrote: > > >>> >>>>> > > >>> >>>>> Hi Rajini. I'm going to choose * > > >>> connections.max.expired.credentials.ms > > >>> >>>>> <http://connections.max.expired.credentials.ms>* as the option > > >>> name > > >>> >>>>> because it is consistent with the comment I made in the section > > >>> about > > >>> >>> how > > >>> >>>>> to get the client and server to agree on credential lifetime: > > >>> >>>>> > > >>> >>>>> "We could add a new API call so that clients could ask servers > > for > > >>> the > > >>> >>>>> lifetime they use, or we could extend the > > >>> SaslHandshakeRequest/Response > > >>> >>> API > > >>> >>>>> call to include that information in the server's response – the > > >>> client > > >>> >>>>> would then adopt that value" > > >>> >>>>> > > >>> >>>>> > > >>> >>>>> We set the config option value on the broker (with the " > > >>> >>>>> listener.name.mechanism." prefix), and we will return the > > >>> configured > > >>> >>>>> value in the SaslHandshakeResponse (requiring a wire format > > change > > >>> in > > >>> >>>>> addition to a version bump). The value (referred to as "X" > > below) > > >>> can > > >>> >>> be > > >>> >>>>> negative, zero, or positive and is to be interpreted as > follows: > > >>> >>>>> > > >>> >>>>> X = 0: Fully disable. The server will respond to > > >>> re-authentications, > > >>> >>> but > > >>> >>>>> it won't kill any connections due to expiration, and it won't > > track > > >>> >>> either > > >>> >>>>> of the 2 metrics mentioned below. > > >>> >>>>> > > >>> >>>>> Now, a couple of definitions for when X != 0: > > >>> >>>>> > > >>> >>>>> 1) We define the *maximum allowed expiration time* to be the > time > > >>> >>>>> determined by the point when (re-)authentication occurs plus > |X| > > >>> >>>>> milliseconds. > > >>> >>>>> 2) We define the *requested expiration time* to be the maximum > > >>> allowed > > >>> >>>>> expiration time except for the case of an OAuth bearer token, > in > > >>> which > > >>> >>> case > > >>> >>>>> it is the point at which the token expires. (Kerberos > presumably > > >>> also > > >>> >>>>> specifies a ticket lifetime, but frankly I am not in a position > > to > > >>> do > > >>> >>> any > > >>> >>>>> Kerberos-related coding in time for a 2.1.0 release and would > > >>> prefer to > > >>> >>>>> ignore this piece of information for this KIP -- would it be > > >>> acceptable > > >>> >>> to > > >>> >>>>> have someone else add it later?). > > >>> >>>>> > > >>> >>>>> Based on these definitions, we define the behavior as follows: > > >>> >>>>> > > >>> >>>>> *X > 0: Fully enable* > > >>> >>>>> A) The server will reject any authentication/re-authentication > > >>> attempt > > >>> >>>>> when the requested expiration time is after the maximum allowed > > >>> >>> expiration > > >>> >>>>> time (which could only happen with an OAuth bearer token, > > assuming > > >>> we > > >>> >>> skip > > >>> >>>>> dealing with Kerberos for now). > > >>> >>>>> B) The server will kill connections that are used beyond the > > >>> requested > > >>> >>>>> expiration time. > > >>> >>>>> C) A broker metric will be maintained that documents the number > > >>> >>>>> connections killed by the broker. This metric will be non-zero > > if > > >>> a > > >>> >>> client > > >>> >>>>> is connecting to the broker with re-authentication either > > >>> unavailable > > >>> >>> (i.e. > > >>> >>>>> an older client) or disabled. > > >>> >>>>> > > >>> >>>>> *X < 0: Partially enable* > > >>> >>>>> A) Same as above: the server will reject any > > >>> >>>>> authentication/re-authentication attempt when the requested > > >>> expiration > > >>> >>> time > > >>> >>>>> is after the maximum allowed expiration time (which could only > > >>> happen > > >>> >>> with > > >>> >>>>> an OAuth bearer token, assuming we skip dealing with Kerberos > for > > >>> now). > > >>> >>>>> B) The server will **not** kill connections that are used > beyond > > >>> the > > >>> >>>>> requested expiration time. > > >>> >>>>> C) A broker metric will be maintained that documents the number > > of > > >>> API > > >>> >>>>> requests unrelated to re-authentication that are made over a > > >>> connection > > >>> >>>>> beyond the requested expiration time. This metric will be > > helpful > > >>> for > > >>> >>>>> operators to ensure that all clients are properly upgraded and > > >>> >>>>> re-authenticating before fully enabling the server-side > > >>> >>>>> expired-connection-kill functionality (i.e. by changing the > value > > >>> from a > > >>> >>>>> negative number to a positive number): this metric will be zero > > >>> across > > >>> >>> all > > >>> >>>>> brokers when it is safe to fully enable the server-side > feature. > > >>> >>>>> > > >>> >>>>> Thoughts? > > >>> >>>>> > > >>> >>>>> Ron > > >>> >>>>> > > >>> >>>>> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram < > > >>> >>> rajinisiva...@gmail.com> > > >>> >>>>> wrote: > > >>> >>>>> > > >>> >>>>>> It will be good to shorten the config name. We have a config > > >>> named ` > > >>> >>>>>> connections.max.idle.ms`. We could add something like ` > > >>> >>>>>> connections.max.expired.credentials.ms`. As you said, it > would > > be > > >>> >>>>>> prefixed > > >>> >>>>>> with listener and SASL mechanism name. We should be able to > > >>> support > > >>> >>>>>> connection termination in future even with SSL, so perhaps we > > >>> don't > > >>> >>> want > > >>> >>>>>> the `sasl.server.` prefix (we just need to validate whether > the > > >>> config > > >>> >>> is > > >>> >>>>>> supported for the protocol). You can choose whether a boolean > > >>> flag or a > > >>> >>>>>> time interval makes more sense. > > >>> >>>>>> > > >>> >>>>>> For the client-side, the KIP explains how to make it work for > > >>> other > > >>> >>>>>> mechanisms and we can leave it that. Not convinced about > > >>> server-side > > >>> >>>>>> though. It seems to me that we probably would require API > > changes > > >>> to > > >>> >>> make > > >>> >>>>>> it work. Basically the proposed approach works only for > > >>> OAUTHBEARER. > > >>> >>>>>> Since > > >>> >>>>>> we have to bump up SaslHandshakeRequest version in this KIP, > it > > >>> will be > > >>> >>>>>> good to work out if we need to change the request or flow to > > make > > >>> this > > >>> >>>>>> work > > >>> >>>>>> for other mechanisms. I haven't figured out exactly what is > > >>> needed, but > > >>> >>>>>> will think about it and get back next week. In the meantime, > you > > >>> can > > >>> >>> get > > >>> >>>>>> the KIP up-to-date with the config, migration path etc. and > get > > it > > >>> >>> ready > > >>> >>>>>> to > > >>> >>>>>> start vote next week. > > >>> >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>> >>>>>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino < > > rndg...@gmail.com > > >>> > > > >>> >>>>>> wrote: > > >>> >>>>>> > > >>> >>>>>>> Hi Rajini. > > >>> >>>>>>> > > >>> >>>>>>> Agreed, we will bump the SaslHandshakeRequest API number (no > > wire > > >>> >>>>>> format > > >>> >>>>>>> change, of course). > > >>> >>>>>>> > > >>> >>>>>>> Agreed, responding to re-authentication will always be > enabled > > >>> on the > > >>> >>>>>>> broker since it is initiated by the client. > > >>> >>>>>>> > > >>> >>>>>>> Agreed, connection termination by the broker will be opt-in. > > I'm > > >>> >>>>>> thinking > > >>> >>>>>>> *sasl.server.disconnect.expired.credential.connections. > > enable*, > > >>> >>>>>> prefixed > > >>> >>>>>>> with listener and SASL mechanism name in lower-case so it can > > be > > >>> >>>>>> opt-in at > > >>> >>>>>>> the granularity of a SASL mechanism; for example, " > > >>> >>>>>>> listener.name.sasl_ssl.oauthbearer.sasl.server. > > >>> >>>>>>> disconnect.expired.credential.connections.enable=true > > >>> >>>>>>> ". It is long, so if you have a preference for a shorter > name > > >>> let me > > >>> >>>>>> know > > >>> >>>>>>> and we'll go with it instead. > > >>> >>>>>>> > > >>> >>>>>>> Agreed, additional metrics will be helpful. I'll add them to > > the > > >>> >>> KIP. > > >>> >>>>>>> > > >>> >>>>>>> <<<need to work out exactly how this works with all SASL > > >>> mechanisms > > >>> >>>>>>> <<<not sure we have the data required on the broker or a way > of > > >>> >>>>>>> <<< extending the [GSSAPI] mechanism > > >>> >>>>>>> Not sure I agree here. The paragraph I added to the KIP > > >>> describes > > >>> >>> how > > >>> >>>>>> I > > >>> >>>>>>> think this can be done. Given that description, do you still > > >>> feel > > >>> >>>>>> Kerberos > > >>> >>>>>>> will not be possible? If Kerberos presents a significant > > hurdle > > >>> >>> then I > > >>> >>>>>>> don't think that should prevent us from doing it for other > > >>> mechanisms > > >>> >>>>>> -- I > > >>> >>>>>>> would rather state that it isn't supported with GSSAPI due to > > >>> >>>>>> limitations > > >>> >>>>>>> in Kerberos itself than not have it for OAUTHBEARER, for > > example. > > >>> >>> And > > >>> >>>>>>> right now I don't think we need more than a high-level > > >>> description of > > >>> >>>>>> how > > >>> >>>>>>> this could be done. In other words, I think we have this > > >>> covered, > > >>> >>>>>> unless > > >>> >>>>>>> there is a fundamental problem due to Kerberos not making > data > > >>> (i.e. > > >>> >>>>>> ticket > > >>> >>>>>>> expiration) available on the server. I want to submit this > > KIP > > >>> for > > >>> >>> a > > >>> >>>>>> vote > > >>> >>>>>>> early next week in the hope of having it accepted by the > > Monday, > > >>> 9/24 > > >>> >>>>>>> deadline for the 2.1.0 release, and if that happens I > believe I > > >>> can > > >>> >>>>>> get a > > >>> >>>>>>> PR into really good shape soon thereafter (I'm working on it > > >>> now). > > >>> >>>>>>> > > >>> >>>>>>> Ron > > >>> >>>>>>> > > >>> >>>>>>> > > >>> >>>>>>> > > >>> >>>>>>> > > >>> >>>>>>> > > >>> >>>>>>> > > >>> >>>>>>> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram < > > >>> >>>>>> rajinisiva...@gmail.com> > > >>> >>>>>>> wrote: > > >>> >>>>>>> > > >>> >>>>>>>> Hi Ron, > > >>> >>>>>>>> > > >>> >>>>>>>> 1) Yes, we should update the version of > > `SaslHandshakeRequest`. > > >>> >>>>>> Clients > > >>> >>>>>>>> would attempt to re-authenticate only if broker's > > >>> >>>>>> SaslHandshakeRequest > > >>> >>>>>>>> version >=2. > > >>> >>>>>>>> 2) I think we should enable broker's re-authentication and > > >>> >>> connection > > >>> >>>>>>>> termination code regardless of client version. > > Re-authentication > > >>> >>>>>> could be > > >>> >>>>>>>> always enabled since it is triggered only by clients and > > broker > > >>> >>> could > > >>> >>>>>>> just > > >>> >>>>>>>> handle it. Connection termination should be opt-in, but > should > > >>> work > > >>> >>>>>> with > > >>> >>>>>>>> clients of any version. If turned on and clients are of an > > older > > >>> >>>>>> version, > > >>> >>>>>>>> this would simply force reconnection, which should be ok. > > >>> >>> Additional > > >>> >>>>>>>> metrics to monitor expired connections may be useful anyway. > > >>> >>>>>>>> > > >>> >>>>>>>> We also need to work out exactly how this works with all > SASL > > >>> >>>>>> mechanisms. > > >>> >>>>>>>> In particular, we have ensure that we can make it work with > > >>> >>> Kerberos > > >>> >>>>>>> since > > >>> >>>>>>>> we use the implementation from the JRE and I am not sure we > > have > > >>> >>> the > > >>> >>>>>> data > > >>> >>>>>>>> required on the broker or a way of extending the mechanism. > > >>> >>>>>>>> > > >>> >>>>>>>> Thoughts? > > >>> >>>>>>>> > > >>> >>>>>>>> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino < > > >>> rndg...@gmail.com> > > >>> >>>>>>> wrote: > > >>> >>>>>>>> > > >>> >>>>>>>>> Hi Rajini. I'm thinking about how we deal with migration. > > >>> For > > >>> >>>>>>> example, > > >>> >>>>>>>>> let's say we have an existing 2.0.0 cluster using > > >>> >>> SASL/OAUTHBEARER > > >>> >>>>>> and > > >>> >>>>>>> we > > >>> >>>>>>>>> want to use this feature. The desired end state is to have > > all > > >>> >>>>>> clients > > >>> >>>>>>>> and > > >>> >>>>>>>>> brokers migrated to a version that supports the feature > (call > > >>> it > > >>> >>>>>> 2.x) > > >>> >>>>>>>> with > > >>> >>>>>>>>> the feature turned on. We need to document the supported > > >>> path(s) > > >>> >>>>>> to > > >>> >>>>>>> this > > >>> >>>>>>>>> end state. > > >>> >>>>>>>>> > > >>> >>>>>>>>> Here are a couple of scenarios with implications: > > >>> >>>>>>>>> > > >>> >>>>>>>>> 1) *Migrate client to 2.x and turn the feature on but > broker > > is > > >>> >>>>>> still > > >>> >>>>>>> at > > >>> >>>>>>>>> 2.0.* In this case the client is going to get an error > when > > it > > >>> >>>>>> sends > > >>> >>>>>>> the > > >>> >>>>>>>>> SaslHandshakeRequest. We want to avoid this. It seems to > me > > >>> the > > >>> >>>>>>> client > > >>> >>>>>>>>> needs to know if the broker has been upgraded to the > > necessary > > >>> >>>>>> version > > >>> >>>>>>>>> before trying to re-authenticate, which makes me believe we > > >>> need > > >>> >>> to > > >>> >>>>>>> bump > > >>> >>>>>>>>> the API version number in 2.x and the client has to check > to > > >>> see > > >>> >>>>>> if the > > >>> >>>>>>>> max > > >>> >>>>>>>>> version the broker supports meets a minimum standard before > > >>> >>> trying > > >>> >>>>>> to > > >>> >>>>>>>>> re-authenticate. Do you agree? > > >>> >>>>>>>>> > > >>> >>>>>>>>> 2) *Migrate broker to 2.x and turn the feature on but > client > > is > > >>> >>>>>> still > > >>> >>>>>>> at > > >>> >>>>>>>>> 2.0*. In this case the broker is going to end up killing > > >>> >>>>>> connections > > >>> >>>>>>>>> periodically. Again, we want to avoid this. It's > tempting > > to > > >>> >>> say > > >>> >>>>>>>> "don't > > >>> >>>>>>>>> do this" but I wonder if we can require installations to > > >>> upgrade > > >>> >>>>>> all > > >>> >>>>>>>>> clients before turning the feature on at the brokers. > > >>> Certainly > > >>> >>>>>> we can > > >>> >>>>>>>> say > > >>> >>>>>>>>> "don't do this" for inter-broker clients -- in other words, > > we > > >>> >>> can > > >>> >>>>>> say > > >>> >>>>>>>> that > > >>> >>>>>>>>> all brokers in a cluster should be upgraded before the > > feature > > >>> is > > >>> >>>>>>> turned > > >>> >>>>>>>> on > > >>> >>>>>>>>> for any one of them -- but I don't know about our ability > to > > >>> say > > >>> >>>>>> it for > > >>> >>>>>>>>> non-broker clients. > > >>> >>>>>>>>> > > >>> >>>>>>>>> Now we consider the cases where both the brokers and the > > >>> clients > > >>> >>>>>> have > > >>> >>>>>>>> been > > >>> >>>>>>>>> upgraded to 2.x. When and where should the feature be > turned > > >>> on? > > >>> >>>>>> The > > >>> >>>>>>>>> inter-broker case is interesting because I don't think > think > > we > > >>> >>> can > > >>> >>>>>>>> require > > >>> >>>>>>>>> installations to restart every broker with a new config > where > > >>> the > > >>> >>>>>>> feature > > >>> >>>>>>>>> is turned on at the same time. Do you agree that we cannot > > >>> >>> require > > >>> >>>>>>> this > > >>> >>>>>>>>> and therefore must support installations restarting brokers > > >>> with > > >>> >>> a > > >>> >>>>>> new > > >>> >>>>>>>>> config at whatever pace they need -- which may be quite > slow > > >>> >>>>>> relative > > >>> >>>>>>> to > > >>> >>>>>>>>> token lifetimes? Assuming you do agree, then there is > going > > to > > >>> >>> be > > >>> >>>>>> the > > >>> >>>>>>>> case > > >>> >>>>>>>>> where some brokers are going to have the feature turned on > > and > > >>> >>> some > > >>> >>>>>>>> clients > > >>> >>>>>>>>> (definitely inter-broker clients at a minimum) are going to > > >>> have > > >>> >>>>>> the > > >>> >>>>>>>>> feature turned off. > > >>> >>>>>>>>> > > >>> >>>>>>>>> The above discussion assumes a single config on the broker > > side > > >>> >>>>>> that > > >>> >>>>>>>> turns > > >>> >>>>>>>>> on both the inter-broker client re-authentication feature > as > > >>> well > > >>> >>>>>> as > > >>> >>>>>>> the > > >>> >>>>>>>>> server-side expired-connection-kill feature, but now I'm > > >>> thinking > > >>> >>>>>> we > > >>> >>>>>>> need > > >>> >>>>>>>>> the ability to turn these features on independently, plus > > maybe > > >>> >>> we > > >>> >>>>>>> need a > > >>> >>>>>>>>> way to monitor the number of "active, expired" connections > to > > >>> the > > >>> >>>>>>> server > > >>> >>>>>>>> so > > >>> >>>>>>>>> that operators can be sure that all clients have been > > >>> >>>>>> upgraded/enabled > > >>> >>>>>>>>> before turning on the server-side expired-connection-kill > > >>> >>> feature. > > >>> >>>>>>>>> > > >>> >>>>>>>>> So the migration plan would be as follows: > > >>> >>>>>>>>> > > >>> >>>>>>>>> 1) Upgrade all brokers to 2.x. > > >>> >>>>>>>>> 2) After all brokers are upgraded, turn on > re-authentication > > >>> for > > >>> >>>>>> all > > >>> >>>>>>>>> brokers at whatever rate is desired -- just eventually, at > > some > > >>> >>>>>> point, > > >>> >>>>>>>> get > > >>> >>>>>>>>> the client-side feature turned on for all brokers so that > > >>> >>>>>> inter-broker > > >>> >>>>>>>>> connections are re-authenticating. > > >>> >>>>>>>>> 3) In parallel with (1) and (2) above, upgrade clients to > 2.x > > >>> and > > >>> >>>>>> turn > > >>> >>>>>>>>> their re-authentication feature on. Clients will check the > > API > > >>> >>>>>> version > > >>> >>>>>>>> and > > >>> >>>>>>>>> only re-authenticate to a broker that has also been > upgraded > > >>> >>> (note > > >>> >>>>>> that > > >>> >>>>>>>> the > > >>> >>>>>>>>> ability of a broker to respond to a re-authentication > cannot > > be > > >>> >>>>>> turned > > >>> >>>>>>>> off > > >>> >>>>>>>>> -- it is always on beginning with version 2.x, and it just > > sits > > >>> >>>>>> there > > >>> >>>>>>>> doing > > >>> >>>>>>>>> nothing if it isn't exercised by an enabled client) > > >>> >>>>>>>>> 4) After (1), (2), and (3) are complete, check the 2.x > broker > > >>> >>>>>> metrics > > >>> >>>>>>> to > > >>> >>>>>>>>> confirm that there are no "active, expired" connections. > > Once > > >>> >>> you > > >>> >>>>>> are > > >>> >>>>>>>>> satisfied that (1), (2), and (3) are indeed complete you > can > > >>> >>>>>> enable the > > >>> >>>>>>>>> server-side expired-connection-kill feature on each broker > > via > > >>> a > > >>> >>>>>>> restart > > >>> >>>>>>>> at > > >>> >>>>>>>>> whatever pace is desired. > > >>> >>>>>>>>> > > >>> >>>>>>>>> Comments? > > >>> >>>>>>>>> > > >>> >>>>>>>>> Ron > > >>> >>>>>>>>> > > >>> >>>>>>>>> > > >>> >>>>>>>>> On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino < > > >>> rndg...@gmail.com > > >>> >>>> > > >>> >>>>>>> wrote: > > >>> >>>>>>>>> > > >>> >>>>>>>>>> Ok, I am tempted to just say we go with the low-level > > approach > > >>> >>>>>> since > > >>> >>>>>>> it > > >>> >>>>>>>>> is > > >>> >>>>>>>>>> the quickest and seems to meet the clear requirements. We > > can > > >>> >>>>>> always > > >>> >>>>>>>>> adjust > > >>> >>>>>>>>>> later if we get to clarity on other requirements or we > > decide > > >>> >>> we > > >>> >>>>>> need > > >>> >>>>>>>> to > > >>> >>>>>>>>>> approach it differently for whatever reason. But in the > > >>> >>>>>> meantime, > > >>> >>>>>>>> before > > >>> >>>>>>>>>> fully committing to this decision, I would appreciate > > another > > >>> >>>>>>>> perspective > > >>> >>>>>>>>>> if someone has one. > > >>> >>>>>>>>>> > > >>> >>>>>>>>>> Ron > > >>> >>>>>>>>>> > > >>> >>>>>>>>>> On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram < > > >>> >>>>>>>> rajinisiva...@gmail.com> > > >>> >>>>>>>>>> wrote: > > >>> >>>>>>>>>> > > >>> >>>>>>>>>>> Hi Ron, > > >>> >>>>>>>>>>> > > >>> >>>>>>>>>>> Yes, I would leave out retries from this KIP for now. In > > the > > >>> >>>>>> future, > > >>> >>>>>>>> if > > >>> >>>>>>>>>>> there is a requirement for supporting retries, we can > > >>> consider > > >>> >>>>>> it. I > > >>> >>>>>>>>> think > > >>> >>>>>>>>>>> we can support retries with either approach if we needed > > to, > > >>> >>>>>> but it > > >>> >>>>>>>>> would > > >>> >>>>>>>>>>> be better to do it along with other changes required to > > >>> >>> support > > >>> >>>>>>>>>>> authentication servers that are not highly available. > > >>> >>>>>>>>>>> > > >>> >>>>>>>>>>> For maintainability, I am biased, so it will be good to > get > > >>> >>> the > > >>> >>>>>>>>>>> perspective > > >>> >>>>>>>>>>> of others in the community :-) > > >>> >>>>>>>>>>> > > >>> >>>>>>>>>>> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino < > > >>> >>>>>> rndg...@gmail.com> > > >>> >>>>>>>>> wrote: > > >>> >>>>>>>>>>> > > >>> >>>>>>>>>>>> Hi Rajini. Here is some feedback/some things I thought > > of. > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> First, I just realized that from a timing perspective > > that I > > >>> >>>>>> am > > >>> >>>>>>> not > > >>> >>>>>>>>> sure > > >>> >>>>>>>>>>>> retry is going to be necessary. The background login > > >>> >>> refresh > > >>> >>>>>>> thread > > >>> >>>>>>>>>>>> triggers re-authentication when it refreshes the > client's > > >>> >>>>>>>> credential. > > >>> >>>>>>>>>>> The > > >>> >>>>>>>>>>>> OAuth infrastructure has to be available in order for > this > > >>> >>>>>> refresh > > >>> >>>>>>>> to > > >>> >>>>>>>>>>>> succeed (the background thread repeatedly retries if it > > >>> >>> can't > > >>> >>>>>>>> refresh > > >>> >>>>>>>>>>> the > > >>> >>>>>>>>>>>> credential, and that retry loop handles any temporary > > >>> >>>>>> outage). If > > >>> >>>>>>>>>>> clients > > >>> >>>>>>>>>>>> are told to re-authenticate when the credential is > > refreshed > > >>> >>>>>> **and > > >>> >>>>>>>>> they > > >>> >>>>>>>>>>>> actually re-authenticate at that point** then it is > highly > > >>> >>>>>>> unlikely > > >>> >>>>>>>>> that > > >>> >>>>>>>>>>>> the OAuth infrastructure would fail within those > > intervening > > >>> >>>>>>>>>>> milliseconds. > > >>> >>>>>>>>>>>> So we don't need re-authentication retry in this KIP as > > long > > >>> >>>>>> as > > >>> >>>>>>>> retry > > >>> >>>>>>>>>>>> starts immediately. The low-level prototype as > currently > > >>> >>>>>> coded > > >>> >>>>>>>>> doesn't > > >>> >>>>>>>>>>>> actually start re-authentication until the connection is > > >>> >>>>>>>> subsequently > > >>> >>>>>>>>>>> used, > > >>> >>>>>>>>>>>> and that could take a while. But then again, if the > > >>> >>>>>> connection > > >>> >>>>>>>> isn't > > >>> >>>>>>>>>>> used > > >>> >>>>>>>>>>>> for some period of time, then the lost value of having > to > > >>> >>>>>> abandon > > >>> >>>>>>>> the > > >>> >>>>>>>>>>>> connection is lessened anyway. Plus, as you pointed > out, > > >>> >>>>>> OAuth > > >>> >>>>>>>>>>>> infrastructure is assumed to be highly available. > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> Does this makes sense, and would you be willing to say > > that > > >>> >>>>>> retry > > >>> >>>>>>>>> isn't > > >>> >>>>>>>>>>> a > > >>> >>>>>>>>>>>> necessary requirement? I'm tempted but would like to > hear > > >>> >>>>>> your > > >>> >>>>>>>>>>> perspective > > >>> >>>>>>>>>>>> first. > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> Interleaving/latency and maintainability (more than > lines > > of > > >>> >>>>>> code) > > >>> >>>>>>>> are > > >>> >>>>>>>>>>> the > > >>> >>>>>>>>>>>> two remaining areas of comparison. I did not add the > > >>> >>>>>>>> maintainability > > >>> >>>>>>>>>>> issue > > >>> >>>>>>>>>>>> to the KIP yet, but before I do I thought I would > address > > it > > >>> >>>>>> here > > >>> >>>>>>>>> first > > >>> >>>>>>>>>>> to > > >>> >>>>>>>>>>>> see if we can come to consensus on it because I'm not > > sure I > > >>> >>>>>> see > > >>> >>>>>>> the > > >>> >>>>>>>>>>>> high-level approach as being hard to maintain (yet -- I > > >>> >>> could > > >>> >>>>>> be > > >>> >>>>>>>>>>>> convinced/convince myself; we'll see). I do want to > make > > >>> >>>>>> sure we > > >>> >>>>>>>> are > > >>> >>>>>>>>> on > > >>> >>>>>>>>>>>> the same page about what is required to add > > >>> >>> re-authentication > > >>> >>>>>>>> support > > >>> >>>>>>>>> in > > >>> >>>>>>>>>>>> the high-level case. Granted, the amount is zero in the > > >>> >>>>>> low-level > > >>> >>>>>>>>> case, > > >>> >>>>>>>>>>>> and it doesn't get any better that that, but the amount > in > > >>> >>> the > > >>> >>>>>>>>>>> high-level > > >>> >>>>>>>>>>>> case is very low -- just a few lines of code. For > > example: > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> KafkaAdminClient: > > >>> >>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/commits/ > > >>> >>> 4fa70f38b9 > > >>> >>>>>>>>>>>> d33428ff98b64a3a2bd668f5f28c38#diff- > > >>> >>>>>>> 6869b8fccf6b098cbcb0676e8ceb26a7 > > >>> >>>>>>>>>>>> It is the same few lines of code for KafkaConsumer, > > >>> >>>>>> KafkaProducer, > > >>> >>>>>>>>>>>> WorkerGroupMember, and TransactionMarkerChannelManager > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> The two synchronous I/O use cases are > > >>> >>>>>> ControllerChannelManager and > > >>> >>>>>>>>>>>> ReplicaFetcherBlockingSend (via ReplicaFetcherThread), > and > > >>> >>>>>> they > > >>> >>>>>>>>> require > > >>> >>>>>>>>>>> a > > >>> >>>>>>>>>>>> little bit more -- but not much. > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> Thoughts? > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> Ron > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino < > > >>> >>>>>> rndg...@gmail.com> > > >>> >>>>>>>>>>> wrote: > > >>> >>>>>>>>>>>> > > >>> >>>>>>>>>>>>> Thanks, Rajini. Before I digest/respond to that, > here's > > >>> >>> an > > >>> >>>>>>> update > > >>> >>>>>>>>>>> that I > > >>> >>>>>>>>>>>>> just completed. > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> I added a commit to the PR ( > > >>> >>>>>>>>>>> https://github.com/apache/kafka/pull/5582/) > > >>> >>>>>>>>>>>>> that implements server-side kill of expired OAUTHBEARER > > >>> >>>>>>>> connections. > > >>> >>>>>>>>>>> No > > >>> >>>>>>>>>>>>> tests yet since we still haven't settled on a final > > >>> >>> approach > > >>> >>>>>>>>>>> (low-level > > >>> >>>>>>>>>>>> vs. > > >>> >>>>>>>>>>>>> high-level). > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> I also updated the KIP to reflect the latest discussion > > >>> >>> and > > >>> >>>>>> PR > > >>> >>>>>>> as > > >>> >>>>>>>>>>>> follows: > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> - Include support for brokers killing connections as > > >>> >>>>>> part of > > >>> >>>>>>>> this > > >>> >>>>>>>>>>> KIP > > >>> >>>>>>>>>>>>> (rather than deferring it to a future KIP as was > > >>> >>>>>> originally > > >>> >>>>>>>>>>>> mentioned; the > > >>> >>>>>>>>>>>>> PR now includes it as mentioned -- it was very easy to > > >>> >>>>>> add) > > >>> >>>>>>>>>>>>> - Added metrics (they will mirror existing ones > related > > >>> >>>>>> to > > >>> >>>>>>>>>>>>> authentications; I have not added those to the PR) > > >>> >>>>>>>>>>>>> - Updated the implementation description to reflect > the > > >>> >>>>>>> current > > >>> >>>>>>>>>>> state > > >>> >>>>>>>>>>>>> of the PR, which is a high-level, one-size-fits-all > > >>> >>>>>> approach > > >>> >>>>>>>> (as > > >>> >>>>>>>>>>>> opposed to > > >>> >>>>>>>>>>>>> my initial, even-higher-level approach) > > >>> >>>>>>>>>>>>> - Added a "Rejected Alternative" for the first version > > >>> >>>>>> of the > > >>> >>>>>>>> PR, > > >>> >>>>>>>>>>>>> which injected requests directly into synchronous I/O > > >>> >>>>>>> clients' > > >>> >>>>>>>>>>> queues > > >>> >>>>>>>>>>>>> - Added a "Rejected Alternative" for the low-level > > >>> >>>>>> approach > > >>> >>>>>>> as > > >>> >>>>>>>>>>>>> suggested, but of course we have not formally decided > > >>> >>> to > > >>> >>>>>>> reject > > >>> >>>>>>>>>>> this > > >>> >>>>>>>>>>>>> approach or adopt the current PR implementation. > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> I'll think about where we stand some more before > > >>> >>> responding > > >>> >>>>>>> again. > > >>> >>>>>>>>>>>> Thanks > > >>> >>>>>>>>>>>>> for the above reply. > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> Ron > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram < > > >>> >>>>>>>>>>> rajinisiva...@gmail.com> > > >>> >>>>>>>>>>>>> wrote: > > >>> >>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> Hi Ron, > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> Thank you for summarising, I think it covers the > > >>> >>>>>> differences > > >>> >>>>>>>>> between > > >>> >>>>>>>>>>> the > > >>> >>>>>>>>>>>>>> two approaches well. > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> A few minor points to answer the questions in there: > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> 1) When re-authetication is initiated in the Selector > > >>> >>>>>> during > > >>> >>>>>>>>> poll(), > > >>> >>>>>>>>>>> we > > >>> >>>>>>>>>>>>>> can > > >>> >>>>>>>>>>>>>> move an idle channel to re-authentication state. It is > > >>> >>>>>> similar > > >>> >>>>>>> to > > >>> >>>>>>>>>>>>>> injecting > > >>> >>>>>>>>>>>>>> requests, but achieved by changing channel back to > > >>> >>>>>>> authenticating > > >>> >>>>>>>>>>> state. > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> 3) To clarify why I think re-authentication should fit > > in > > >>> >>>>>> with > > >>> >>>>>>>> our > > >>> >>>>>>>>>>>>>> authentication design: My point was not about a > specific > > >>> >>>>>>>> connection > > >>> >>>>>>>>>>>> being > > >>> >>>>>>>>>>>>>> usable or not usable. It was about what happens at the > > >>> >>>>>> client > > >>> >>>>>>> API > > >>> >>>>>>>>>>> level. > > >>> >>>>>>>>>>>>>> Our client API (producer/consumer/admin client etc.) > > >>> >>>>>> currently > > >>> >>>>>>>>> assume > > >>> >>>>>>>>>>>> that > > >>> >>>>>>>>>>>>>> a single broker authentication failure is a fatal > error > > >>> >>>>>> that is > > >>> >>>>>>>>> never > > >>> >>>>>>>>>>>>>> retried because we assume that broker only ever fails > an > > >>> >>>>>>>>>>> authentication > > >>> >>>>>>>>>>>>>> request if credentials are invalid. If we ever decide > to > > >>> >>>>>>> support > > >>> >>>>>>>>>>> cases > > >>> >>>>>>>>>>>>>> where broker occasionally fails an authentication > > request > > >>> >>>>>> due > > >>> >>>>>>> to > > >>> >>>>>>>> a > > >>> >>>>>>>>>>>>>> transient failure, we need to do more around how we > > >>> >>> handle > > >>> >>>>>>>>>>>> authentication > > >>> >>>>>>>>>>>>>> failures in clients. We may decide that it is ok to > > close > > >>> >>>>>> the > > >>> >>>>>>>>>>> connection > > >>> >>>>>>>>>>>>>> for authentication and not for re-authentication as > you > > >>> >>>>>>>> mentioned, > > >>> >>>>>>>>>>> but > > >>> >>>>>>>>>>>> we > > >>> >>>>>>>>>>>>>> need to change the way this disconnection is handled > by > > >>> >>>>>>> clients. > > >>> >>>>>>>> So > > >>> >>>>>>>>>>> IMO, > > >>> >>>>>>>>>>>>>> we > > >>> >>>>>>>>>>>>>> should either add support for transient retriable > > >>> >>>>>>> authentication > > >>> >>>>>>>>>>>> failures > > >>> >>>>>>>>>>>>>> properly or not retry for any scenario. Personally, I > > >>> >>> don't > > >>> >>>>>>> think > > >>> >>>>>>>>> we > > >>> >>>>>>>>>>>> would > > >>> >>>>>>>>>>>>>> want to retry all authentication failures even if it > is > > a > > >>> >>>>>>>>>>>>>> re-authentication, I think we could (at some point in > > >>> >>>>>> future), > > >>> >>>>>>>>> allow > > >>> >>>>>>>>>>>>>> brokers to return an error code that indicates that it > > >>> >>> is a > > >>> >>>>>>>>> transient > > >>> >>>>>>>>>>>>>> broker-side failure rather than invalid credentials > and > > >>> >>>>>> handle > > >>> >>>>>>>> the > > >>> >>>>>>>>>>> error > > >>> >>>>>>>>>>>>>> differently. I see no reason at that point why we > > >>> >>> wouldn't > > >>> >>>>>>> handle > > >>> >>>>>>>>>>>>>> authentication and re-authentication in the same way. > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> 4) As you said, the high-level approach would be > bigger > > >>> >>>>>> than > > >>> >>>>>>> the > > >>> >>>>>>>>>>>> low-level > > >>> >>>>>>>>>>>>>> approach in terms of LOC. But I wouldn't be too > > concerned > > >>> >>>>>> about > > >>> >>>>>>>>>>> lines of > > >>> >>>>>>>>>>>>>> code. My bigger concern was about modularity. Our > > >>> >>> security > > >>> >>>>>> code > > >>> >>>>>>>> is > > >>> >>>>>>>>>>>> already > > >>> >>>>>>>>>>>>>> complex, protocols like Kerberos and SSL that we use > > from > > >>> >>>>>> the > > >>> >>>>>>> JRE > > >>> >>>>>>>>>>> make > > >>> >>>>>>>>>>>>>> problem diagnosis hard. Async I/O makes the networking > > >>> >>> code > > >>> >>>>>>>>> complex. > > >>> >>>>>>>>>>> You > > >>> >>>>>>>>>>>>>> need to understand networking layer to work with the > > >>> >>>>>> security > > >>> >>>>>>>>> layer, > > >>> >>>>>>>>>>> but > > >>> >>>>>>>>>>>>>> the rest of the code base doesn't rely on knowledge of > > >>> >>>>>>>>>>> network/security > > >>> >>>>>>>>>>>>>> layers. My main concern about the high-level approach > is > > >>> >>>>>> that > > >>> >>>>>>> it > > >>> >>>>>>>>>>> spans > > >>> >>>>>>>>>>>>>> these boundaries, making it harder to maintain in the > > >>> >>> long > > >>> >>>>>> run. > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski > < > > >>> >>>>>>>>>>>>>> stanis...@confluent.io> wrote: > > >>> >>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> Hi Ron, Rajini > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> Thanks for summarizing the discussion so far, Ron! > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> 1) How often do we have such long-lived connection > > >>> >>>>>> idleness > > >>> >>>>>>>> (e.g > > >>> >>>>>>>>>>> 5-10 > > >>> >>>>>>>>>>>>>>> minutes) in practice? > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> 3) I agree that retries for re-authentication are > > >>> >>> useful. > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> 4) The interleaving of requests sounds like a great > > >>> >>>>>> feature > > >>> >>>>>>> to > > >>> >>>>>>>>>>> have, > > >>> >>>>>>>>>>>> but > > >>> >>>>>>>>>>>>>>> the tradeoff against code complexity is a tough one. > I > > >>> >>>>>> would > > >>> >>>>>>>>>>>> personally > > >>> >>>>>>>>>>>>>> go > > >>> >>>>>>>>>>>>>>> with the simpler approach since you could always add > > >>> >>>>>>>> interleaving > > >>> >>>>>>>>>>> on > > >>> >>>>>>>>>>>>>> top if > > >>> >>>>>>>>>>>>>>> the community decides the latency should be better. > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> Best, > > >>> >>>>>>>>>>>>>>> Stanislav > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino < > > >>> >>>>>>>> rndg...@gmail.com > > >>> >>>>>>>>>> > > >>> >>>>>>>>>>>>>> wrote: > > >>> >>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> Hi everyone. I've updated the PR to reflect the > > >>> >>> latest > > >>> >>>>>>>>>>> conclusions > > >>> >>>>>>>>>>>>>> from > > >>> >>>>>>>>>>>>>>>> this ongoing discussion. The KIP still needs the > > >>> >>>>>> suggested > > >>> >>>>>>>>>>>> updates; I > > >>> >>>>>>>>>>>>>>> will > > >>> >>>>>>>>>>>>>>>> do that later this week. II agree with Rajini that > > >>> >>>>>> some > > >>> >>>>>>>>>>> additional > > >>> >>>>>>>>>>>>>>>> feedback from the community at large would be very > > >>> >>>>>> helpful > > >>> >>>>>>> at > > >>> >>>>>>>>>>> this > > >>> >>>>>>>>>>>>>> point > > >>> >>>>>>>>>>>>>>> in > > >>> >>>>>>>>>>>>>>>> time. > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> Here's where we stand. > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> We have 2 separate implementations for > > >>> >>>>>> re-authenticating > > >>> >>>>>>>>> existing > > >>> >>>>>>>>>>>>>>>> connections -- a so-called "low-level" approach and > a > > >>> >>>>>>>>> (relatively > > >>> >>>>>>>>>>>>>>> speaking) > > >>> >>>>>>>>>>>>>>>> "high-level" approach -- and we agree on how they > > >>> >>>>>> should be > > >>> >>>>>>>>>>>> compared. > > >>> >>>>>>>>>>>>>>>> Specifically, Rajini has provided a mechanism that > > >>> >>>>>> works > > >>> >>>>>>> at a > > >>> >>>>>>>>>>>>>> relatively > > >>> >>>>>>>>>>>>>>>> low level in the stack by intercepting network sends > > >>> >>>>>> and > > >>> >>>>>>>>> queueing > > >>> >>>>>>>>>>>>>> them up > > >>> >>>>>>>>>>>>>>>> while re-authentication happens; then the queued > > >>> >>> sends > > >>> >>>>>> are > > >>> >>>>>>>> sent > > >>> >>>>>>>>>>>> after > > >>> >>>>>>>>>>>>>>>> re-authentication succeeds, or the connection is > > >>> >>>>>> closed if > > >>> >>>>>>>>>>>>>>>> re-authentication fails. See the prototype commit > at > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> https://github.com/rajinisivaram/kafka/commit/ > > >>> >>>>>>> b9d711907ad843 > > >>> >>>>>>>>>>>>>>> c11d17e80d6743bfb1d4e3f3fd > > >>> >>>>>>>>>>>>>>>> . > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> I have an implementation that works higher up in the > > >>> >>>>>> stack; > > >>> >>>>>>>> it > > >>> >>>>>>>>>>>> injects > > >>> >>>>>>>>>>>>>>>> authentication requests into the KafkaClient via a > > >>> >>> new > > >>> >>>>>>> method > > >>> >>>>>>>>>>> added > > >>> >>>>>>>>>>>> to > > >>> >>>>>>>>>>>>>>> that > > >>> >>>>>>>>>>>>>>>> interface, and the implementation (i.e. > > >>> >>> NetworkClient) > > >>> >>>>>>> sends > > >>> >>>>>>>>> them > > >>> >>>>>>>>>>>> when > > >>> >>>>>>>>>>>>>>>> poll() is called. The updated PR is available at > > >>> >>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/. > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> Here is how we think these two approaches should be > > >>> >>>>>>> compared. > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> 1) *When re-authentication starts*. The low-level > > >>> >>>>>> approach > > >>> >>>>>>>>>>>> initiates > > >>> >>>>>>>>>>>>>>>> re-authentication only if/when the connection is > > >>> >>>>>> actually > > >>> >>>>>>>> used, > > >>> >>>>>>>>>>> so > > >>> >>>>>>>>>>>> it > > >>> >>>>>>>>>>>>>> may > > >>> >>>>>>>>>>>>>>>> start after the existing credential expires; the > > >>> >>>>>> current PR > > >>> >>>>>>>>>>>>>>> implementation > > >>> >>>>>>>>>>>>>>>> injects re-authentication requests into the existing > > >>> >>>>>> flow, > > >>> >>>>>>>> and > > >>> >>>>>>>>>>>>>>>> re-authentication starts immediately regardless of > > >>> >>>>>> whether > > >>> >>>>>>> or > > >>> >>>>>>>>> not > > >>> >>>>>>>>>>>> the > > >>> >>>>>>>>>>>>>>>> connection is being used for something else. Rajini > > >>> >>>>>>> believes > > >>> >>>>>>>>> the > > >>> >>>>>>>>>>>>>>> low-level > > >>> >>>>>>>>>>>>>>>> approach can be adjusted to re-authenticate idle > > >>> >>>>>>> connections > > >>> >>>>>>>>>>>> (Rajini, > > >>> >>>>>>>>>>>>>> I > > >>> >>>>>>>>>>>>>>>> don't see how this can be done without injecting > > >>> >>>>>> requests, > > >>> >>>>>>>>> which > > >>> >>>>>>>>>>> is > > >>> >>>>>>>>>>>>>> what > > >>> >>>>>>>>>>>>>>>> the high-level connection is doing, but I am > probably > > >>> >>>>>>> missing > > >>> >>>>>>>>>>>>>> something > > >>> >>>>>>>>>>>>>>> -- > > >>> >>>>>>>>>>>>>>>> no need to go into it unless it is easy to explain.) > > >>> >>>>>>>>>>>>>>>> > > >>> >>>>>>>>>>>>>>>> 2) *When requests not related to re-authentication > > >>> >>> are > > >>> >>>>>> able > > >>> >>>>>>>> to > > >>> >>>>>>>>>>> use > > >>> >>>>>>>>>>>> the > > >>> >>>>>>>>>>>>>>>> re-authenticating connection*. The low-level > > >>> >>> approach > > >>> >>>>>>>> finishes > > >>> >>>>>>>>>>>>>>>> re-authentication completely before allowing > anything > > >>> >>>>>> else > > >>> >>>>>>> to > > >>> >>>>>>>>>>>> traverse > > >>> >>>>>>>>>>>>>>> the > > >>> >>>>>>>>>>>>>>>> connection, and we agree that this is how the > > >>> >>> low-level > > >>> >>>>>>>>>>>> implementation > > >>> >>>>>>>>>>>>>>> must > > >>> >>>>>>>>>>>>>>>> work without significant work/changes. The current > > >>> >>> PR > > >>> >>>>>>>>>>>> implementation > > >>> >>>>>>>>>>>>>>>> interleaves re-authentication requests with the > > >>> >>>>>> existing > > >>> >>>>>>> flow > > >>> >>>>>>>>> in > > >>> >>>>>>>>>>> the > > >>> >>>>>>