Hi Rajini. The KIP is updated. Aside from a once-over to make sure it is all accurate, I think we need to confirm the metrics. The decision to not reject authentications that use tokens with too-long a lifetime allowed the metrics to be simpler. I decided that in addition to tracking these metrics on the broker:
failed-reauthentication-{rate,total} and successful-reauthentication-{rate,total} we simply need one more set of broker metrics to track the subset of clients clients that are not upgraded to v2.1.0 and are still using a V0 SaslAuthenticateRequest: failed-v0-authentication-{rate,total} and successful-v0-authentication-{rate,total} See the Migration section of the KIP for details of how this would be used. I wonder if we need a broker metric documenting the number of "expired" sessions killed by the broker since it would be the same as successful-v0-authentication-total. I've eliminated that from the KIP for now. Thoughts? There is also a client-side metric for re-authentication latency tracking (still unnamed -- do you have a preference?) I think we're close to being able to put this KIP up for a vote. Ron On Mon, Sep 17, 2018 at 2:45 PM Ron Dagostino <rndg...@gmail.com> wrote: > <<<extra field alongside errors, not in the opaque body as mentioned in > the KIP > Right, that was a concern I had mentioned in a follow-up email. Agree it > should go alongside errors. > > <<<SCRAM for delegation token based authentication where credentials have > <<<an expiry time, so we do need to be able to set individual credential > <<<lifetimes, where the server knows the expiry time but client may not > Ah, ok, I was not aware that this case existed. Agreed, for consistency, > server will always send it back to the client. > > <<<I was trying to avoid this altogether [SaslClient negotiated property > for lifetime] > Since we now agree that the server will send it back, yes, there is no > need for this. > > <<<wasn't entirely clear about the purpose of > [sasl.login.refresh.reauthenticate.enable] > I think this might have been more useful when we weren't necessarily going > to support all SASL mechanisms and/or when the broker was not going to > advertise the fact that it supported re-authentication. You are correct, > now that we support it for all SASL mechanisms and we are bumping an API > version, I think it is okay to simply enable it wherever both the client > and server meet the required versions. > > <<<wasn't entirely clear about the purpose of [connections.max.reauth.ms] > <<<wasn't expecting that we would reject > <<<tokens or tickets simply because they were too long-lived > <<<tickets are granted by a 3rd party authority > <<<Client re-authenticates even though token was not > <<<refreshed. Does this matter? > I was going under the assumption that it would matter, but based on your > pushback I just realized that the same functionality can be implemented as > part of token validation if there is a desire to limit token lifetimes to a > certain max value (and the token validator has to be provided in production > anyway since all we provide out-of-the-box is the unsecured validator). So > I'm willing to abandon this check as part of re-authentication. > > I'll adjust the KIP accordingly a bit later. Thanks for the continued > feedback/discussion. > > Ron > > > > > On Mon, Sep 17, 2018 at 2:10 PM Rajini Sivaram <rajinisiva...@gmail.com> > wrote: > >> Hi Ron, >> >> >> *1) Is there a reason to communicate this value back to the client when >> the >> client already knows it? It's an extra network round-trip, and since the >> SASL round-tripsare defined by the spec I'm not certain adding an extra >> round-trip is acceptable.* >> >> I wasn't suggesting an extra round-trip for propagating session lifetime. >> I >> was expecting session lifetime to be added to the last SASL_AUTHENTICATE >> response from the broker. Because SASL is a challenge-response mechanism, >> SaslServer knows when the response being sent is the last one and hence >> can >> send the session lifetime in the response (in the same way as we propagate >> errors). I was expecting this to be added as an extra field alongside >> errors, not in the opaque body as mentioned in the KIP. The opaque byte >> array strictly conforms to the SASL mechanism wire protocol and we want to >> keep it that way. >> >> As you have said, we don't need server to propagate session lifetime for >> OAUTHBEARER since client knows the token lifetime. But server also knows >> the credential lifetime and by having the server decide the lifetime, we >> can use the same code path for all mechanisms. If we only have a constant >> max lifetime in the server, for PLAIN and SCRAM we will end up having the >> same lifetime for all credentials with no ability to set actual expiry. We >> use SCRAM for delegation token based authentication where credentials have >> an expiry time, so we do need to be able to set individual credential >> lifetimes, where the server knows the expiry time but client may not. >> >> *2) I also just realized that if the client is to learn the credential >> lifetime we wouldn't want to put special-case code in the Authenticator >> for >> GSSAPI and OAUTHBEARER; we would want to expose the value generically, >> probably as a negotiated property on the SaslClient instance.* >> >> I was trying to avoid this altogether. Client doesn't need to know >> credential lifetime. Server asks the client to re-authenticate within its >> session lifetime. >> >> 3) From the KIP, I wasn't entirely clear about the purpose of the two >> configs: >> >> sasl.login.refresh.reauthenticate.enable: Do we need this? Client knows if >> broker version supports re-authentication based on the SASL_AUTHENTICATE >> version returned in ApiVersionsResponse. Client knows if broker is >> configured to enable re-authentication based on session lifetime returned >> in SaslAuthenticateResponse. If broker has re-authentication configured >> and >> client supports re-authentication, you would always want re-authenticate. >> So I wasn't sure why we need a config to opt in or out on the client-side. >> >> >> connections.max.reauth.ms: We obviously need a broker-side config. Not >> entirely sure about the semantics of the config that drives >> re-authentication. In particular, I wasn't expecting that we would reject >> tokens or tickets simply because they were too long-lived. Since tokens or >> tickets are granted by a 3rd party authority, I am not sure if clients >> will >> always have control over the lifetime. Do we need to support any more >> scenarios than these: >> >> >> A) reauth.ms=10,credential.lifetime.ms=10 : Broker sets >> session.lifetime=10, >> so this works. >> >> B) reauth.ms=10, credential.lifetime.ms=5 : Broker sets >> session.lifetime=5, >> so this works. >> C) reauth.ms=10, credential.lifetime.ms=20 : Broker sets >> session.lifetime=10. Client re-authenticates even though token was not >> refreshed. Does this matter? >> D) reauth.ms=Long.MAX_VALUE, credential.lifetime.ms=10: Broker sets >> session.lifetime=10, client re-authenticates based on credential expiry. >> E) reauth.ms=0 (default), credential.lifetime.ms=10 : Broker sets >> session.lifetime=0, Broker doesn't terminate sessions, client doesn't >> re-authenticate. We generate useful metrics. >> F) reauth.ms=0 (default),no lifetime for credential (e.g. PLAIN): Broker >> sets session.lifetime=0, Broker doesn't terminate sessions, client doesn't >> re-authenticate >> G) reauth.ms=10,no lifetime for credential (e.g. PLAIN) : Broker sets >> session.lifetime=10. Client re-authenticates. >> >> I would have thought that D) is the typical scenario for OAuth/Kerberos to >> respect token expiry time. G) would be typical scenario for PLAIN to force >> re-authenication at regular intervals. A/B/C are useful to force >> re-authentication in scenarios where you might check for credential >> revocation in the server. And E/F are useful to disable re-authentication >> and generate metrics (also the default behaviour useful during migration). >> Have I missed something? >> >> >> On Mon, Sep 17, 2018 at 4:27 PM, Ron Dagostino <rndg...@gmail.com> wrote: >> >> > Hi yet again, Rajini. I also just realized that if the client is to >> learn >> > the credential lifetime we wouldn't want to put special-case code in the >> > Authenticator for GSSAPI and OAUTHBEARER; we would want to expose the >> value >> > generically, probably as a negotiated property on the SaslClient >> instance. >> > We might be talking about making the negotiated property key part of the >> > public API. In other words, the SaslClient would be responsible for >> > exposing the credential (i.e. token or ticket) lifetime at a well-known >> > negotiated property name, such as "Credential.Lifetime" and putting a >> Long >> > value there if there is a lifetime. That well-klnown key (e.g. >> > "Credential.Lifetime") would be part of the public API, right? >> > >> > Ron >> > >> > On Mon, Sep 17, 2018 at 11:03 AM Ron Dagostino <rndg...@gmail.com> >> wrote: >> > >> > > Hi again, Rajini. After thinking about this a little while, it >> occurs to >> > > me that maybe the communication of max session lifetime should occur >> via >> > > SASL_HANDSHAKE after all. Here's why. The value communicated is the >> max >> > > session lifetime allowed, and the client can assume it is the the >> session >> > > lifetime for that particular session unless the particular SASL >> mechanism >> > > could result in a shorter session that would be obvious to the client >> and >> > > the server. In particular, for OAUTHBEARER, the session lifetime >> will be >> > > the token lifetime, which the client and server will both know. Is >> > there a >> > > reason to communicate this value back to the client when the client >> > already >> > > knows it? It's an extra network round-trip, and since the SASL >> > round-trips >> > > are defined by the spec I'm not certain adding an extra round-trip is >> > > acceptable. Even if we decide we can add it, it helps with latency >> if we >> > > don't. >> > > >> > > Kerberos may be a bit different -- I don't know if the broker can >> learn >> > > the session lifetime. If it can then the same thing holds -- both >> client >> > > and server will know the session lifetime and there is no reason to >> > > communicate it back. If the server can't learn the lifetime then I >> don't >> > > think adding an extra SASL_AUTHENTICATE round trip is going to help, >> > anyway. >> > > >> > > Also, by communicating the max session lifetime in the SASL_HANDSHAKE >> > > response, both OAUTHBEARER and GSSAPI clients will be able to know >> before >> > > sending any SASL_AUTHENTICATE requests whether their credential >> violates >> > > the maximum. This allows a behaving client to give a good error >> message. >> > > A malicious client would ignore the value and send a longer-lived >> > > credential, and then that would be rejected on the server side. >> > > >> > > I'm still good with ExpiringCredential not needing to be public. >> > > >> > > What do you think? >> > > >> > > Ron >> > > >> > > Ron >> > > >> > > Ron >> > > >> > > On Mon, Sep 17, 2018 at 9:13 AM Ron Dagostino <rndg...@gmail.com> >> wrote: >> > > >> > >> Hi Rajini. I've updated the KIP to reflect the decision to fully >> > support >> > >> this for all SASL mechanisms and to not require the >> ExpiringCredential >> > >> interface to be public. >> > >> >> > >> Ron >> > >> >> > >> On Mon, Sep 17, 2018 at 6:55 AM Ron Dagostino <rndg...@gmail.com> >> > wrote: >> > >> >> > >>> Actually, I think the metric remains the same assuming the >> > >>> authentication fails when the token lifetime is too long. Negative >> > max >> > >>> config on server counts what would have been killed because maybe >> > client >> > >>> re-auth is not turned on; positive max enables the kill, which is >> > counted >> > >>> by a second metric as proposed. The current proposal had already >> > stated >> > >>> that a non-zero value would disallow an authentication with a token >> > that >> > >>> has too large a lifetime, and that is still the case, I think. But >> > let’s >> > >>> make sure we are on the same page about all this. >> > >>> >> > >>> Ron >> > >>> >> > >>> > On Sep 17, 2018, at 6:42 AM, Ron Dagostino <rndg...@gmail.com> >> > wrote: >> > >>> > >> > >>> > Hi Rajini. I see what you are saying. The background login >> refresh >> > >>> thread does have to factor into the decision for OAUTHBEARER because >> > there >> > >>> is no new token to re-authenticate with until that refresh succeeds >> > >>> (similarly with Kerberos), but I think you are right that the >> interface >> > >>> doesn’t necessarily have to be public. The server does decide the >> > time for >> > >>> PLAIN and the SCRAM-related mechanisms under the current proposal, >> but >> > it >> > >>> is done in SaslHandshakeResponse, and at that point the server won’t >> > have >> > >>> any token yet; making it happen via SaslAuthenticateRequest at the >> > very end >> > >>> does allow the server to know everything for all mechanisms. >> > >>> > >> > >>> > I see two potential issues to discuss. First, the server must >> > >>> communicate a time that exceeds the (token or ticket) refresh period >> > on the >> > >>> client. Assuming it communicates the expiration time of the >> > token/ticket, >> > >>> that’s the best it can do. So I think that’ll be fine. >> > >>> > >> > >>> > The second issue is what happens if the server is configured to >> > accept >> > >>> a max value — say, an hour — and the token is good for longer. I >> > assumed >> > >>> that the client should not be allowed to authenticate in the first >> > place >> > >>> because it could then simply re-authenticate with the same token >> after >> > an >> > >>> hour, which defeats the motivations for this KIP. So do we agree >> the >> > max >> > >>> token lifetime allowed will be enforced at authentication time? >> > Assuming >> > >>> so, then we need to discuss migration. >> > >>> > >> > >>> > The current proposal includes a metric that can be used to >> identify >> > if >> > >>> an OAUTHBEARER client is misconfigured — the number of connections >> that >> > >>> would have been killed (could be non-zero when the configured max >> is a >> > >>> negative number). Do we still want this type of an indication, and >> if >> > so, >> > >>> is it still done the same way — a negative number for max, but >> instead >> > of >> > >>> counting the number of kills that would have been done it counts the >> > number >> > >>> of authentications that would have been failed? >> > >>> > >> > >>> > Ron >> > >>> > >> > >>> >> On Sep 17, 2018, at 6:06 AM, Rajini Sivaram < >> > rajinisiva...@gmail.com> >> > >>> wrote: >> > >>> >> >> > >>> >> Hi Ron, >> > >>> >> >> > >>> >> Thinking a bit more about other SASL mechanisms, I think the >> issue >> > is >> > >>> that >> > >>> >> in the current proposal, clients decide the re-authentication >> > period. >> > >>> For >> > >>> >> mechanisms like PLAIN or SCRAM, we would actually want server to >> > >>> determine >> > >>> >> the re-authentication interval. For example, the PLAIN or SCRAM >> > >>> database >> > >>> >> could specify the expiry time for each principal, or the broker >> > could >> > >>> be >> > >>> >> configured with a single expiry time for all principals of that >> > >>> mechanism. >> > >>> >> For OAuth, brokers do know the expiry time. Haven't figured out >> what >> > >>> to do >> > >>> >> with Kerberos, but in any case for broker-side connection >> > termination >> > >>> on >> > >>> >> expiry, we need the broker to know/decide the expiry time. So I >> > would >> > >>> like >> > >>> >> to suggest a slightly different approach to managing >> > >>> re-authentications. >> > >>> >> >> > >>> >> 1) Instead of changing SASL_HANDSHAKE version number, we bump up >> > >>> >> SASL_AUTHENTICATE version number. >> > >>> >> 2) In the final SASL_AUTHENTICATE response, broker returns the >> > expiry >> > >>> time >> > >>> >> of this authenticated session. This is the interval after which >> > >>> broker will >> > >>> >> terminate the connection If it wasn't re-authenticated. >> > >>> >> 3) We no longer need the public interface change to add ` >> > >>> >> org.apache.kafka.common.security.expiring.ExpiringCredential` for >> > the >> > >>> >> client-side. Instead we schedule the next re-authentication on >> the >> > >>> client >> > >>> >> based on the expiry time provided by the server (some time >> earlier >> > >>> than the >> > >>> >> expiry). >> > >>> >> 4) If client uses SASL_AUTHENTICATE v0, broker will not return >> > expiry >> > >>> time. >> > >>> >> The connection will be terminated if that feature is enabled (the >> > >>> same code >> > >>> >> path as client failing to re-authenticte on time). >> > >>> >> >> > >>> >> Thoughts? >> > >>> >> >> > >>> >>> On Sat, Sep 15, 2018 at 3:06 AM, Ron Dagostino < >> rndg...@gmail.com> >> > >>> wrote: >> > >>> >>> >> > >>> >>> Hi everyone. I've updated the KIP to reflect all discussion to >> > date, >> > >>> >>> including the decision to go with the low-level approach. This >> > >>> latest >> > >>> >>> version of the KIP includes the above " >> connections.max.reauth.ms" >> > >>> >>> proposal, >> > >>> >>> which I know has not been discussed yet. It mentions new >> metrics, >> > >>> some of >> > >>> >>> which may not have been discussed either (and names are missing >> > from >> > >>> some >> > >>> >>> of them). Regardless, this new version is the closest yet to a >> > >>> version >> > >>> >>> that can be put to a vote next week. >> > >>> >>> >> > >>> >>> Ron >> > >>> >>> >> > >>> >>>> On Fri, Sep 14, 2018 at 8:59 PM Ron Dagostino < >> rndg...@gmail.com> >> > >>> wrote: >> > >>> >>>> >> > >>> >>>> Minor correction: I'm proposing "connections.max.reauth.ms" as >> > the >> > >>> >>>> broker-side configuration property, not " >> > >>> >>>> connections.max.expired.credentials.ms". >> > >>> >>>> >> > >>> >>>> Ron >> > >>> >>>> >> > >>> >>>>> On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino < >> rndg...@gmail.com >> > > >> > >>> wrote: >> > >>> >>>>> >> > >>> >>>>> Hi Rajini. I'm going to choose * >> > >>> connections.max.expired.credentials.ms >> > >>> >>>>> <http://connections.max.expired.credentials.ms>* as the >> option >> > >>> name >> > >>> >>>>> because it is consistent with the comment I made in the >> section >> > >>> about >> > >>> >>> how >> > >>> >>>>> to get the client and server to agree on credential lifetime: >> > >>> >>>>> >> > >>> >>>>> "We could add a new API call so that clients could ask servers >> > for >> > >>> the >> > >>> >>>>> lifetime they use, or we could extend the >> > >>> SaslHandshakeRequest/Response >> > >>> >>> API >> > >>> >>>>> call to include that information in the server's response – >> the >> > >>> client >> > >>> >>>>> would then adopt that value" >> > >>> >>>>> >> > >>> >>>>> >> > >>> >>>>> We set the config option value on the broker (with the " >> > >>> >>>>> listener.name.mechanism." prefix), and we will return the >> > >>> configured >> > >>> >>>>> value in the SaslHandshakeResponse (requiring a wire format >> > change >> > >>> in >> > >>> >>>>> addition to a version bump). The value (referred to as "X" >> > below) >> > >>> can >> > >>> >>> be >> > >>> >>>>> negative, zero, or positive and is to be interpreted as >> follows: >> > >>> >>>>> >> > >>> >>>>> X = 0: Fully disable. The server will respond to >> > >>> re-authentications, >> > >>> >>> but >> > >>> >>>>> it won't kill any connections due to expiration, and it won't >> > track >> > >>> >>> either >> > >>> >>>>> of the 2 metrics mentioned below. >> > >>> >>>>> >> > >>> >>>>> Now, a couple of definitions for when X != 0: >> > >>> >>>>> >> > >>> >>>>> 1) We define the *maximum allowed expiration time* to be the >> time >> > >>> >>>>> determined by the point when (re-)authentication occurs plus >> |X| >> > >>> >>>>> milliseconds. >> > >>> >>>>> 2) We define the *requested expiration time* to be the maximum >> > >>> allowed >> > >>> >>>>> expiration time except for the case of an OAuth bearer token, >> in >> > >>> which >> > >>> >>> case >> > >>> >>>>> it is the point at which the token expires. (Kerberos >> presumably >> > >>> also >> > >>> >>>>> specifies a ticket lifetime, but frankly I am not in a >> position >> > to >> > >>> do >> > >>> >>> any >> > >>> >>>>> Kerberos-related coding in time for a 2.1.0 release and would >> > >>> prefer to >> > >>> >>>>> ignore this piece of information for this KIP -- would it be >> > >>> acceptable >> > >>> >>> to >> > >>> >>>>> have someone else add it later?). >> > >>> >>>>> >> > >>> >>>>> Based on these definitions, we define the behavior as follows: >> > >>> >>>>> >> > >>> >>>>> *X > 0: Fully enable* >> > >>> >>>>> A) The server will reject any authentication/re-authentication >> > >>> attempt >> > >>> >>>>> when the requested expiration time is after the maximum >> allowed >> > >>> >>> expiration >> > >>> >>>>> time (which could only happen with an OAuth bearer token, >> > assuming >> > >>> we >> > >>> >>> skip >> > >>> >>>>> dealing with Kerberos for now). >> > >>> >>>>> B) The server will kill connections that are used beyond the >> > >>> requested >> > >>> >>>>> expiration time. >> > >>> >>>>> C) A broker metric will be maintained that documents the >> number >> > >>> >>>>> connections killed by the broker. This metric will be >> non-zero >> > if >> > >>> a >> > >>> >>> client >> > >>> >>>>> is connecting to the broker with re-authentication either >> > >>> unavailable >> > >>> >>> (i.e. >> > >>> >>>>> an older client) or disabled. >> > >>> >>>>> >> > >>> >>>>> *X < 0: Partially enable* >> > >>> >>>>> A) Same as above: the server will reject any >> > >>> >>>>> authentication/re-authentication attempt when the requested >> > >>> expiration >> > >>> >>> time >> > >>> >>>>> is after the maximum allowed expiration time (which could only >> > >>> happen >> > >>> >>> with >> > >>> >>>>> an OAuth bearer token, assuming we skip dealing with Kerberos >> for >> > >>> now). >> > >>> >>>>> B) The server will **not** kill connections that are used >> beyond >> > >>> the >> > >>> >>>>> requested expiration time. >> > >>> >>>>> C) A broker metric will be maintained that documents the >> number >> > of >> > >>> API >> > >>> >>>>> requests unrelated to re-authentication that are made over a >> > >>> connection >> > >>> >>>>> beyond the requested expiration time. This metric will be >> > helpful >> > >>> for >> > >>> >>>>> operators to ensure that all clients are properly upgraded and >> > >>> >>>>> re-authenticating before fully enabling the server-side >> > >>> >>>>> expired-connection-kill functionality (i.e. by changing the >> value >> > >>> from a >> > >>> >>>>> negative number to a positive number): this metric will be >> zero >> > >>> across >> > >>> >>> all >> > >>> >>>>> brokers when it is safe to fully enable the server-side >> feature. >> > >>> >>>>> >> > >>> >>>>> Thoughts? >> > >>> >>>>> >> > >>> >>>>> Ron >> > >>> >>>>> >> > >>> >>>>> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram < >> > >>> >>> rajinisiva...@gmail.com> >> > >>> >>>>> wrote: >> > >>> >>>>> >> > >>> >>>>>> It will be good to shorten the config name. We have a config >> > >>> named ` >> > >>> >>>>>> connections.max.idle.ms`. We could add something like ` >> > >>> >>>>>> connections.max.expired.credentials.ms`. As you said, it >> would >> > be >> > >>> >>>>>> prefixed >> > >>> >>>>>> with listener and SASL mechanism name. We should be able to >> > >>> support >> > >>> >>>>>> connection termination in future even with SSL, so perhaps we >> > >>> don't >> > >>> >>> want >> > >>> >>>>>> the `sasl.server.` prefix (we just need to validate whether >> the >> > >>> config >> > >>> >>> is >> > >>> >>>>>> supported for the protocol). You can choose whether a boolean >> > >>> flag or a >> > >>> >>>>>> time interval makes more sense. >> > >>> >>>>>> >> > >>> >>>>>> For the client-side, the KIP explains how to make it work for >> > >>> other >> > >>> >>>>>> mechanisms and we can leave it that. Not convinced about >> > >>> server-side >> > >>> >>>>>> though. It seems to me that we probably would require API >> > changes >> > >>> to >> > >>> >>> make >> > >>> >>>>>> it work. Basically the proposed approach works only for >> > >>> OAUTHBEARER. >> > >>> >>>>>> Since >> > >>> >>>>>> we have to bump up SaslHandshakeRequest version in this KIP, >> it >> > >>> will be >> > >>> >>>>>> good to work out if we need to change the request or flow to >> > make >> > >>> this >> > >>> >>>>>> work >> > >>> >>>>>> for other mechanisms. I haven't figured out exactly what is >> > >>> needed, but >> > >>> >>>>>> will think about it and get back next week. In the meantime, >> you >> > >>> can >> > >>> >>> get >> > >>> >>>>>> the KIP up-to-date with the config, migration path etc. and >> get >> > it >> > >>> >>> ready >> > >>> >>>>>> to >> > >>> >>>>>> start vote next week. >> > >>> >>>>>> >> > >>> >>>>>> >> > >>> >>>>>> >> > >>> >>>>>> >> > >>> >>>>>> >> > >>> >>>>>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino < >> > rndg...@gmail.com >> > >>> > >> > >>> >>>>>> wrote: >> > >>> >>>>>> >> > >>> >>>>>>> Hi Rajini. >> > >>> >>>>>>> >> > >>> >>>>>>> Agreed, we will bump the SaslHandshakeRequest API number (no >> > wire >> > >>> >>>>>> format >> > >>> >>>>>>> change, of course). >> > >>> >>>>>>> >> > >>> >>>>>>> Agreed, responding to re-authentication will always be >> enabled >> > >>> on the >> > >>> >>>>>>> broker since it is initiated by the client. >> > >>> >>>>>>> >> > >>> >>>>>>> Agreed, connection termination by the broker will be opt-in. >> > I'm >> > >>> >>>>>> thinking >> > >>> >>>>>>> *sasl.server.disconnect.expired.credential.connections. >> > enable*, >> > >>> >>>>>> prefixed >> > >>> >>>>>>> with listener and SASL mechanism name in lower-case so it >> can >> > be >> > >>> >>>>>> opt-in at >> > >>> >>>>>>> the granularity of a SASL mechanism; for example, " >> > >>> >>>>>>> listener.name.sasl_ssl.oauthbearer.sasl.server. >> > >>> >>>>>>> disconnect.expired.credential.connections.enable=true >> > >>> >>>>>>> ". It is long, so if you have a preference for a shorter >> name >> > >>> let me >> > >>> >>>>>> know >> > >>> >>>>>>> and we'll go with it instead. >> > >>> >>>>>>> >> > >>> >>>>>>> Agreed, additional metrics will be helpful. I'll add them >> to >> > the >> > >>> >>> KIP. >> > >>> >>>>>>> >> > >>> >>>>>>> <<<need to work out exactly how this works with all SASL >> > >>> mechanisms >> > >>> >>>>>>> <<<not sure we have the data required on the broker or a >> way of >> > >>> >>>>>>> <<< extending the [GSSAPI] mechanism >> > >>> >>>>>>> Not sure I agree here. The paragraph I added to the KIP >> > >>> describes >> > >>> >>> how >> > >>> >>>>>> I >> > >>> >>>>>>> think this can be done. Given that description, do you >> still >> > >>> feel >> > >>> >>>>>> Kerberos >> > >>> >>>>>>> will not be possible? If Kerberos presents a significant >> > hurdle >> > >>> >>> then I >> > >>> >>>>>>> don't think that should prevent us from doing it for other >> > >>> mechanisms >> > >>> >>>>>> -- I >> > >>> >>>>>>> would rather state that it isn't supported with GSSAPI due >> to >> > >>> >>>>>> limitations >> > >>> >>>>>>> in Kerberos itself than not have it for OAUTHBEARER, for >> > example. >> > >>> >>> And >> > >>> >>>>>>> right now I don't think we need more than a high-level >> > >>> description of >> > >>> >>>>>> how >> > >>> >>>>>>> this could be done. In other words, I think we have this >> > >>> covered, >> > >>> >>>>>> unless >> > >>> >>>>>>> there is a fundamental problem due to Kerberos not making >> data >> > >>> (i.e. >> > >>> >>>>>> ticket >> > >>> >>>>>>> expiration) available on the server. I want to submit this >> > KIP >> > >>> for >> > >>> >>> a >> > >>> >>>>>> vote >> > >>> >>>>>>> early next week in the hope of having it accepted by the >> > Monday, >> > >>> 9/24 >> > >>> >>>>>>> deadline for the 2.1.0 release, and if that happens I >> believe I >> > >>> can >> > >>> >>>>>> get a >> > >>> >>>>>>> PR into really good shape soon thereafter (I'm working on it >> > >>> now). >> > >>> >>>>>>> >> > >>> >>>>>>> Ron >> > >>> >>>>>>> >> > >>> >>>>>>> >> > >>> >>>>>>> >> > >>> >>>>>>> >> > >>> >>>>>>> >> > >>> >>>>>>> >> > >>> >>>>>>> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram < >> > >>> >>>>>> rajinisiva...@gmail.com> >> > >>> >>>>>>> wrote: >> > >>> >>>>>>> >> > >>> >>>>>>>> Hi Ron, >> > >>> >>>>>>>> >> > >>> >>>>>>>> 1) Yes, we should update the version of >> > `SaslHandshakeRequest`. >> > >>> >>>>>> Clients >> > >>> >>>>>>>> would attempt to re-authenticate only if broker's >> > >>> >>>>>> SaslHandshakeRequest >> > >>> >>>>>>>> version >=2. >> > >>> >>>>>>>> 2) I think we should enable broker's re-authentication and >> > >>> >>> connection >> > >>> >>>>>>>> termination code regardless of client version. >> > Re-authentication >> > >>> >>>>>> could be >> > >>> >>>>>>>> always enabled since it is triggered only by clients and >> > broker >> > >>> >>> could >> > >>> >>>>>>> just >> > >>> >>>>>>>> handle it. Connection termination should be opt-in, but >> should >> > >>> work >> > >>> >>>>>> with >> > >>> >>>>>>>> clients of any version. If turned on and clients are of an >> > older >> > >>> >>>>>> version, >> > >>> >>>>>>>> this would simply force reconnection, which should be ok. >> > >>> >>> Additional >> > >>> >>>>>>>> metrics to monitor expired connections may be useful >> anyway. >> > >>> >>>>>>>> >> > >>> >>>>>>>> We also need to work out exactly how this works with all >> SASL >> > >>> >>>>>> mechanisms. >> > >>> >>>>>>>> In particular, we have ensure that we can make it work with >> > >>> >>> Kerberos >> > >>> >>>>>>> since >> > >>> >>>>>>>> we use the implementation from the JRE and I am not sure we >> > have >> > >>> >>> the >> > >>> >>>>>> data >> > >>> >>>>>>>> required on the broker or a way of extending the mechanism. >> > >>> >>>>>>>> >> > >>> >>>>>>>> Thoughts? >> > >>> >>>>>>>> >> > >>> >>>>>>>> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino < >> > >>> rndg...@gmail.com> >> > >>> >>>>>>> wrote: >> > >>> >>>>>>>> >> > >>> >>>>>>>>> Hi Rajini. I'm thinking about how we deal with migration. >> > >>> For >> > >>> >>>>>>> example, >> > >>> >>>>>>>>> let's say we have an existing 2.0.0 cluster using >> > >>> >>> SASL/OAUTHBEARER >> > >>> >>>>>> and >> > >>> >>>>>>> we >> > >>> >>>>>>>>> want to use this feature. The desired end state is to >> have >> > all >> > >>> >>>>>> clients >> > >>> >>>>>>>> and >> > >>> >>>>>>>>> brokers migrated to a version that supports the feature >> (call >> > >>> it >> > >>> >>>>>> 2.x) >> > >>> >>>>>>>> with >> > >>> >>>>>>>>> the feature turned on. We need to document the supported >> > >>> path(s) >> > >>> >>>>>> to >> > >>> >>>>>>> this >> > >>> >>>>>>>>> end state. >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> Here are a couple of scenarios with implications: >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> 1) *Migrate client to 2.x and turn the feature on but >> broker >> > is >> > >>> >>>>>> still >> > >>> >>>>>>> at >> > >>> >>>>>>>>> 2.0.* In this case the client is going to get an error >> when >> > it >> > >>> >>>>>> sends >> > >>> >>>>>>> the >> > >>> >>>>>>>>> SaslHandshakeRequest. We want to avoid this. It seems >> to me >> > >>> the >> > >>> >>>>>>> client >> > >>> >>>>>>>>> needs to know if the broker has been upgraded to the >> > necessary >> > >>> >>>>>> version >> > >>> >>>>>>>>> before trying to re-authenticate, which makes me believe >> we >> > >>> need >> > >>> >>> to >> > >>> >>>>>>> bump >> > >>> >>>>>>>>> the API version number in 2.x and the client has to check >> to >> > >>> see >> > >>> >>>>>> if the >> > >>> >>>>>>>> max >> > >>> >>>>>>>>> version the broker supports meets a minimum standard >> before >> > >>> >>> trying >> > >>> >>>>>> to >> > >>> >>>>>>>>> re-authenticate. Do you agree? >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> 2) *Migrate broker to 2.x and turn the feature on but >> client >> > is >> > >>> >>>>>> still >> > >>> >>>>>>> at >> > >>> >>>>>>>>> 2.0*. In this case the broker is going to end up killing >> > >>> >>>>>> connections >> > >>> >>>>>>>>> periodically. Again, we want to avoid this. It's >> tempting >> > to >> > >>> >>> say >> > >>> >>>>>>>> "don't >> > >>> >>>>>>>>> do this" but I wonder if we can require installations to >> > >>> upgrade >> > >>> >>>>>> all >> > >>> >>>>>>>>> clients before turning the feature on at the brokers. >> > >>> Certainly >> > >>> >>>>>> we can >> > >>> >>>>>>>> say >> > >>> >>>>>>>>> "don't do this" for inter-broker clients -- in other >> words, >> > we >> > >>> >>> can >> > >>> >>>>>> say >> > >>> >>>>>>>> that >> > >>> >>>>>>>>> all brokers in a cluster should be upgraded before the >> > feature >> > >>> is >> > >>> >>>>>>> turned >> > >>> >>>>>>>> on >> > >>> >>>>>>>>> for any one of them -- but I don't know about our ability >> to >> > >>> say >> > >>> >>>>>> it for >> > >>> >>>>>>>>> non-broker clients. >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> Now we consider the cases where both the brokers and the >> > >>> clients >> > >>> >>>>>> have >> > >>> >>>>>>>> been >> > >>> >>>>>>>>> upgraded to 2.x. When and where should the feature be >> turned >> > >>> on? >> > >>> >>>>>> The >> > >>> >>>>>>>>> inter-broker case is interesting because I don't think >> think >> > we >> > >>> >>> can >> > >>> >>>>>>>> require >> > >>> >>>>>>>>> installations to restart every broker with a new config >> where >> > >>> the >> > >>> >>>>>>> feature >> > >>> >>>>>>>>> is turned on at the same time. Do you agree that we >> cannot >> > >>> >>> require >> > >>> >>>>>>> this >> > >>> >>>>>>>>> and therefore must support installations restarting >> brokers >> > >>> with >> > >>> >>> a >> > >>> >>>>>> new >> > >>> >>>>>>>>> config at whatever pace they need -- which may be quite >> slow >> > >>> >>>>>> relative >> > >>> >>>>>>> to >> > >>> >>>>>>>>> token lifetimes? Assuming you do agree, then there is >> going >> > to >> > >>> >>> be >> > >>> >>>>>> the >> > >>> >>>>>>>> case >> > >>> >>>>>>>>> where some brokers are going to have the feature turned on >> > and >> > >>> >>> some >> > >>> >>>>>>>> clients >> > >>> >>>>>>>>> (definitely inter-broker clients at a minimum) are going >> to >> > >>> have >> > >>> >>>>>> the >> > >>> >>>>>>>>> feature turned off. >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> The above discussion assumes a single config on the broker >> > side >> > >>> >>>>>> that >> > >>> >>>>>>>> turns >> > >>> >>>>>>>>> on both the inter-broker client re-authentication feature >> as >> > >>> well >> > >>> >>>>>> as >> > >>> >>>>>>> the >> > >>> >>>>>>>>> server-side expired-connection-kill feature, but now I'm >> > >>> thinking >> > >>> >>>>>> we >> > >>> >>>>>>> need >> > >>> >>>>>>>>> the ability to turn these features on independently, plus >> > maybe >> > >>> >>> we >> > >>> >>>>>>> need a >> > >>> >>>>>>>>> way to monitor the number of "active, expired" >> connections to >> > >>> the >> > >>> >>>>>>> server >> > >>> >>>>>>>> so >> > >>> >>>>>>>>> that operators can be sure that all clients have been >> > >>> >>>>>> upgraded/enabled >> > >>> >>>>>>>>> before turning on the server-side expired-connection-kill >> > >>> >>> feature. >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> So the migration plan would be as follows: >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> 1) Upgrade all brokers to 2.x. >> > >>> >>>>>>>>> 2) After all brokers are upgraded, turn on >> re-authentication >> > >>> for >> > >>> >>>>>> all >> > >>> >>>>>>>>> brokers at whatever rate is desired -- just eventually, at >> > some >> > >>> >>>>>> point, >> > >>> >>>>>>>> get >> > >>> >>>>>>>>> the client-side feature turned on for all brokers so that >> > >>> >>>>>> inter-broker >> > >>> >>>>>>>>> connections are re-authenticating. >> > >>> >>>>>>>>> 3) In parallel with (1) and (2) above, upgrade clients to >> 2.x >> > >>> and >> > >>> >>>>>> turn >> > >>> >>>>>>>>> their re-authentication feature on. Clients will check >> the >> > API >> > >>> >>>>>> version >> > >>> >>>>>>>> and >> > >>> >>>>>>>>> only re-authenticate to a broker that has also been >> upgraded >> > >>> >>> (note >> > >>> >>>>>> that >> > >>> >>>>>>>> the >> > >>> >>>>>>>>> ability of a broker to respond to a re-authentication >> cannot >> > be >> > >>> >>>>>> turned >> > >>> >>>>>>>> off >> > >>> >>>>>>>>> -- it is always on beginning with version 2.x, and it just >> > sits >> > >>> >>>>>> there >> > >>> >>>>>>>> doing >> > >>> >>>>>>>>> nothing if it isn't exercised by an enabled client) >> > >>> >>>>>>>>> 4) After (1), (2), and (3) are complete, check the 2.x >> broker >> > >>> >>>>>> metrics >> > >>> >>>>>>> to >> > >>> >>>>>>>>> confirm that there are no "active, expired" connections. >> > Once >> > >>> >>> you >> > >>> >>>>>> are >> > >>> >>>>>>>>> satisfied that (1), (2), and (3) are indeed complete you >> can >> > >>> >>>>>> enable the >> > >>> >>>>>>>>> server-side expired-connection-kill feature on each broker >> > via >> > >>> a >> > >>> >>>>>>> restart >> > >>> >>>>>>>> at >> > >>> >>>>>>>>> whatever pace is desired. >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> Comments? >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> Ron >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino < >> > >>> rndg...@gmail.com >> > >>> >>>> >> > >>> >>>>>>> wrote: >> > >>> >>>>>>>>> >> > >>> >>>>>>>>>> Ok, I am tempted to just say we go with the low-level >> > approach >> > >>> >>>>>> since >> > >>> >>>>>>> it >> > >>> >>>>>>>>> is >> > >>> >>>>>>>>>> the quickest and seems to meet the clear requirements. We >> > can >> > >>> >>>>>> always >> > >>> >>>>>>>>> adjust >> > >>> >>>>>>>>>> later if we get to clarity on other requirements or we >> > decide >> > >>> >>> we >> > >>> >>>>>> need >> > >>> >>>>>>>> to >> > >>> >>>>>>>>>> approach it differently for whatever reason. But in the >> > >>> >>>>>> meantime, >> > >>> >>>>>>>> before >> > >>> >>>>>>>>>> fully committing to this decision, I would appreciate >> > another >> > >>> >>>>>>>> perspective >> > >>> >>>>>>>>>> if someone has one. >> > >>> >>>>>>>>>> >> > >>> >>>>>>>>>> Ron >> > >>> >>>>>>>>>> >> > >>> >>>>>>>>>> On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram < >> > >>> >>>>>>>> rajinisiva...@gmail.com> >> > >>> >>>>>>>>>> wrote: >> > >>> >>>>>>>>>> >> > >>> >>>>>>>>>>> Hi Ron, >> > >>> >>>>>>>>>>> >> > >>> >>>>>>>>>>> Yes, I would leave out retries from this KIP for now. In >> > the >> > >>> >>>>>> future, >> > >>> >>>>>>>> if >> > >>> >>>>>>>>>>> there is a requirement for supporting retries, we can >> > >>> consider >> > >>> >>>>>> it. I >> > >>> >>>>>>>>> think >> > >>> >>>>>>>>>>> we can support retries with either approach if we needed >> > to, >> > >>> >>>>>> but it >> > >>> >>>>>>>>> would >> > >>> >>>>>>>>>>> be better to do it along with other changes required to >> > >>> >>> support >> > >>> >>>>>>>>>>> authentication servers that are not highly available. >> > >>> >>>>>>>>>>> >> > >>> >>>>>>>>>>> For maintainability, I am biased, so it will be good to >> get >> > >>> >>> the >> > >>> >>>>>>>>>>> perspective >> > >>> >>>>>>>>>>> of others in the community :-) >> > >>> >>>>>>>>>>> >> > >>> >>>>>>>>>>> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino < >> > >>> >>>>>> rndg...@gmail.com> >> > >>> >>>>>>>>> wrote: >> > >>> >>>>>>>>>>> >> > >>> >>>>>>>>>>>> Hi Rajini. Here is some feedback/some things I thought >> > of. >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> First, I just realized that from a timing perspective >> > that I >> > >>> >>>>>> am >> > >>> >>>>>>> not >> > >>> >>>>>>>>> sure >> > >>> >>>>>>>>>>>> retry is going to be necessary. The background login >> > >>> >>> refresh >> > >>> >>>>>>> thread >> > >>> >>>>>>>>>>>> triggers re-authentication when it refreshes the >> client's >> > >>> >>>>>>>> credential. >> > >>> >>>>>>>>>>> The >> > >>> >>>>>>>>>>>> OAuth infrastructure has to be available in order for >> this >> > >>> >>>>>> refresh >> > >>> >>>>>>>> to >> > >>> >>>>>>>>>>>> succeed (the background thread repeatedly retries if it >> > >>> >>> can't >> > >>> >>>>>>>> refresh >> > >>> >>>>>>>>>>> the >> > >>> >>>>>>>>>>>> credential, and that retry loop handles any temporary >> > >>> >>>>>> outage). If >> > >>> >>>>>>>>>>> clients >> > >>> >>>>>>>>>>>> are told to re-authenticate when the credential is >> > refreshed >> > >>> >>>>>> **and >> > >>> >>>>>>>>> they >> > >>> >>>>>>>>>>>> actually re-authenticate at that point** then it is >> highly >> > >>> >>>>>>> unlikely >> > >>> >>>>>>>>> that >> > >>> >>>>>>>>>>>> the OAuth infrastructure would fail within those >> > intervening >> > >>> >>>>>>>>>>> milliseconds. >> > >>> >>>>>>>>>>>> So we don't need re-authentication retry in this KIP as >> > long >> > >>> >>>>>> as >> > >>> >>>>>>>> retry >> > >>> >>>>>>>>>>>> starts immediately. The low-level prototype as >> currently >> > >>> >>>>>> coded >> > >>> >>>>>>>>> doesn't >> > >>> >>>>>>>>>>>> actually start re-authentication until the connection >> is >> > >>> >>>>>>>> subsequently >> > >>> >>>>>>>>>>> used, >> > >>> >>>>>>>>>>>> and that could take a while. But then again, if the >> > >>> >>>>>> connection >> > >>> >>>>>>>> isn't >> > >>> >>>>>>>>>>> used >> > >>> >>>>>>>>>>>> for some period of time, then the lost value of having >> to >> > >>> >>>>>> abandon >> > >>> >>>>>>>> the >> > >>> >>>>>>>>>>>> connection is lessened anyway. Plus, as you pointed >> out, >> > >>> >>>>>> OAuth >> > >>> >>>>>>>>>>>> infrastructure is assumed to be highly available. >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> Does this makes sense, and would you be willing to say >> > that >> > >>> >>>>>> retry >> > >>> >>>>>>>>> isn't >> > >>> >>>>>>>>>>> a >> > >>> >>>>>>>>>>>> necessary requirement? I'm tempted but would like to >> hear >> > >>> >>>>>> your >> > >>> >>>>>>>>>>> perspective >> > >>> >>>>>>>>>>>> first. >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> Interleaving/latency and maintainability (more than >> lines >> > of >> > >>> >>>>>> code) >> > >>> >>>>>>>> are >> > >>> >>>>>>>>>>> the >> > >>> >>>>>>>>>>>> two remaining areas of comparison. I did not add the >> > >>> >>>>>>>> maintainability >> > >>> >>>>>>>>>>> issue >> > >>> >>>>>>>>>>>> to the KIP yet, but before I do I thought I would >> address >> > it >> > >>> >>>>>> here >> > >>> >>>>>>>>> first >> > >>> >>>>>>>>>>> to >> > >>> >>>>>>>>>>>> see if we can come to consensus on it because I'm not >> > sure I >> > >>> >>>>>> see >> > >>> >>>>>>> the >> > >>> >>>>>>>>>>>> high-level approach as being hard to maintain (yet -- I >> > >>> >>> could >> > >>> >>>>>> be >> > >>> >>>>>>>>>>>> convinced/convince myself; we'll see). I do want to >> make >> > >>> >>>>>> sure we >> > >>> >>>>>>>> are >> > >>> >>>>>>>>> on >> > >>> >>>>>>>>>>>> the same page about what is required to add >> > >>> >>> re-authentication >> > >>> >>>>>>>> support >> > >>> >>>>>>>>> in >> > >>> >>>>>>>>>>>> the high-level case. Granted, the amount is zero in >> the >> > >>> >>>>>> low-level >> > >>> >>>>>>>>> case, >> > >>> >>>>>>>>>>>> and it doesn't get any better that that, but the >> amount in >> > >>> >>> the >> > >>> >>>>>>>>>>> high-level >> > >>> >>>>>>>>>>>> case is very low -- just a few lines of code. For >> > example: >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> KafkaAdminClient: >> > >>> >>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/commits/ >> > >>> >>> 4fa70f38b9 >> > >>> >>>>>>>>>>>> d33428ff98b64a3a2bd668f5f28c38#diff- >> > >>> >>>>>>> 6869b8fccf6b098cbcb0676e8ceb26a7 >> > >>> >>>>>>>>>>>> It is the same few lines of code for KafkaConsumer, >> > >>> >>>>>> KafkaProducer, >> > >>> >>>>>>>>>>>> WorkerGroupMember, and TransactionMarkerChannelManager >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> The two synchronous I/O use cases are >> > >>> >>>>>> ControllerChannelManager and >> > >>> >>>>>>>>>>>> ReplicaFetcherBlockingSend (via ReplicaFetcherThread), >> and >> > >>> >>>>>> they >> > >>> >>>>>>>>> require >> > >>> >>>>>>>>>>> a >> > >>> >>>>>>>>>>>> little bit more -- but not much. >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> Thoughts? >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> Ron >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino < >> > >>> >>>>>> rndg...@gmail.com> >> > >>> >>>>>>>>>>> wrote: >> > >>> >>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> Thanks, Rajini. Before I digest/respond to that, >> here's >> > >>> >>> an >> > >>> >>>>>>> update >> > >>> >>>>>>>>>>> that I >> > >>> >>>>>>>>>>>>> just completed. >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> I added a commit to the PR ( >> > >>> >>>>>>>>>>> https://github.com/apache/kafka/pull/5582/) >> > >>> >>>>>>>>>>>>> that implements server-side kill of expired >> OAUTHBEARER >> > >>> >>>>>>>> connections. >> > >>> >>>>>>>>>>> No >> > >>> >>>>>>>>>>>>> tests yet since we still haven't settled on a final >> > >>> >>> approach >> > >>> >>>>>>>>>>> (low-level >> > >>> >>>>>>>>>>>> vs. >> > >>> >>>>>>>>>>>>> high-level). >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> I also updated the KIP to reflect the latest >> discussion >> > >>> >>> and >> > >>> >>>>>> PR >> > >>> >>>>>>> as >> > >>> >>>>>>>>>>>> follows: >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> - Include support for brokers killing connections as >> > >>> >>>>>> part of >> > >>> >>>>>>>> this >> > >>> >>>>>>>>>>> KIP >> > >>> >>>>>>>>>>>>> (rather than deferring it to a future KIP as was >> > >>> >>>>>> originally >> > >>> >>>>>>>>>>>> mentioned; the >> > >>> >>>>>>>>>>>>> PR now includes it as mentioned -- it was very easy >> to >> > >>> >>>>>> add) >> > >>> >>>>>>>>>>>>> - Added metrics (they will mirror existing ones >> related >> > >>> >>>>>> to >> > >>> >>>>>>>>>>>>> authentications; I have not added those to the PR) >> > >>> >>>>>>>>>>>>> - Updated the implementation description to reflect >> the >> > >>> >>>>>>> current >> > >>> >>>>>>>>>>> state >> > >>> >>>>>>>>>>>>> of the PR, which is a high-level, one-size-fits-all >> > >>> >>>>>> approach >> > >>> >>>>>>>> (as >> > >>> >>>>>>>>>>>> opposed to >> > >>> >>>>>>>>>>>>> my initial, even-higher-level approach) >> > >>> >>>>>>>>>>>>> - Added a "Rejected Alternative" for the first >> version >> > >>> >>>>>> of the >> > >>> >>>>>>>> PR, >> > >>> >>>>>>>>>>>>> which injected requests directly into synchronous I/O >> > >>> >>>>>>> clients' >> > >>> >>>>>>>>>>> queues >> > >>> >>>>>>>>>>>>> - Added a "Rejected Alternative" for the low-level >> > >>> >>>>>> approach >> > >>> >>>>>>> as >> > >>> >>>>>>>>>>>>> suggested, but of course we have not formally decided >> > >>> >>> to >> > >>> >>>>>>> reject >> > >>> >>>>>>>>>>> this >> > >>> >>>>>>>>>>>>> approach or adopt the current PR implementation. >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> I'll think about where we stand some more before >> > >>> >>> responding >> > >>> >>>>>>> again. >> > >>> >>>>>>>>>>>> Thanks >> > >>> >>>>>>>>>>>>> for the above reply. >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> Ron >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram < >> > >>> >>>>>>>>>>> rajinisiva...@gmail.com> >> > >>> >>>>>>>>>>>>> wrote: >> > >>> >>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> Hi Ron, >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> Thank you for summarising, I think it covers the >> > >>> >>>>>> differences >> > >>> >>>>>>>>> between >> > >>> >>>>>>>>>>> the >> > >>> >>>>>>>>>>>>>> two approaches well. >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> A few minor points to answer the questions in there: >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> 1) When re-authetication is initiated in the Selector >> > >>> >>>>>> during >> > >>> >>>>>>>>> poll(), >> > >>> >>>>>>>>>>> we >> > >>> >>>>>>>>>>>>>> can >> > >>> >>>>>>>>>>>>>> move an idle channel to re-authentication state. It >> is >> > >>> >>>>>> similar >> > >>> >>>>>>> to >> > >>> >>>>>>>>>>>>>> injecting >> > >>> >>>>>>>>>>>>>> requests, but achieved by changing channel back to >> > >>> >>>>>>> authenticating >> > >>> >>>>>>>>>>> state. >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> 3) To clarify why I think re-authentication should >> fit >> > in >> > >>> >>>>>> with >> > >>> >>>>>>>> our >> > >>> >>>>>>>>>>>>>> authentication design: My point was not about a >> specific >> > >>> >>>>>>>> connection >> > >>> >>>>>>>>>>>> being >> > >>> >>>>>>>>>>>>>> usable or not usable. It was about what happens at >> the >> > >>> >>>>>> client >> > >>> >>>>>>> API >> > >>> >>>>>>>>>>> level. >> > >>> >>>>>>>>>>>>>> Our client API (producer/consumer/admin client etc.) >> > >>> >>>>>> currently >> > >>> >>>>>>>>> assume >> > >>> >>>>>>>>>>>> that >> > >>> >>>>>>>>>>>>>> a single broker authentication failure is a fatal >> error >> > >>> >>>>>> that is >> > >>> >>>>>>>>> never >> > >>> >>>>>>>>>>>>>> retried because we assume that broker only ever >> fails an >> > >>> >>>>>>>>>>> authentication >> > >>> >>>>>>>>>>>>>> request if credentials are invalid. If we ever >> decide to >> > >>> >>>>>>> support >> > >>> >>>>>>>>>>> cases >> > >>> >>>>>>>>>>>>>> where broker occasionally fails an authentication >> > request >> > >>> >>>>>> due >> > >>> >>>>>>> to >> > >>> >>>>>>>> a >> > >>> >>>>>>>>>>>>>> transient failure, we need to do more around how we >> > >>> >>> handle >> > >>> >>>>>>>>>>>> authentication >> > >>> >>>>>>>>>>>>>> failures in clients. We may decide that it is ok to >> > close >> > >>> >>>>>> the >> > >>> >>>>>>>>>>> connection >> > >>> >>>>>>>>>>>>>> for authentication and not for re-authentication as >> you >> > >>> >>>>>>>> mentioned, >> > >>> >>>>>>>>>>> but >> > >>> >>>>>>>>>>>> we >> > >>> >>>>>>>>>>>>>> need to change the way this disconnection is handled >> by >> > >>> >>>>>>> clients. >> > >>> >>>>>>>> So >> > >>> >>>>>>>>>>> IMO, >> > >>> >>>>>>>>>>>>>> we >> > >>> >>>>>>>>>>>>>> should either add support for transient retriable >> > >>> >>>>>>> authentication >> > >>> >>>>>>>>>>>> failures >> > >>> >>>>>>>>>>>>>> properly or not retry for any scenario. Personally, I >> > >>> >>> don't >> > >>> >>>>>>> think >> > >>> >>>>>>>>> we >> > >>> >>>>>>>>>>>> would >> > >>> >>>>>>>>>>>>>> want to retry all authentication failures even if it >> is >> > a >> > >>> >>>>>>>>>>>>>> re-authentication, I think we could (at some point in >> > >>> >>>>>> future), >> > >>> >>>>>>>>> allow >> > >>> >>>>>>>>>>>>>> brokers to return an error code that indicates that >> it >> > >>> >>> is a >> > >>> >>>>>>>>> transient >> > >>> >>>>>>>>>>>>>> broker-side failure rather than invalid credentials >> and >> > >>> >>>>>> handle >> > >>> >>>>>>>> the >> > >>> >>>>>>>>>>> error >> > >>> >>>>>>>>>>>>>> differently. I see no reason at that point why we >> > >>> >>> wouldn't >> > >>> >>>>>>> handle >> > >>> >>>>>>>>>>>>>> authentication and re-authentication in the same way. >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> 4) As you said, the high-level approach would be >> bigger >> > >>> >>>>>> than >> > >>> >>>>>>> the >> > >>> >>>>>>>>>>>> low-level >> > >>> >>>>>>>>>>>>>> approach in terms of LOC. But I wouldn't be too >> > concerned >> > >>> >>>>>> about >> > >>> >>>>>>>>>>> lines of >> > >>> >>>>>>>>>>>>>> code. My bigger concern was about modularity. Our >> > >>> >>> security >> > >>> >>>>>> code >> > >>> >>>>>>>> is >> > >>> >>>>>>>>>>>> already >> > >>> >>>>>>>>>>>>>> complex, protocols like Kerberos and SSL that we use >> > from >> > >>> >>>>>> the >> > >>> >>>>>>> JRE >> > >>> >>>>>>>>>>> make >> > >>> >>>>>>>>>>>>>> problem diagnosis hard. Async I/O makes the >> networking >> > >>> >>> code >> > >>> >>>>>>>>> complex. >> > >>> >>>>>>>>>>> You >> > >>> >>>>>>>>>>>>>> need to understand networking layer to work with the >> > >>> >>>>>> security >> > >>> >>>>>>>>> layer, >> > >>> >>>>>>>>>>> but >> > >>> >>>>>>>>>>>>>> the rest of the code base doesn't rely on knowledge >> of >> > >>> >>>>>>>>>>> network/security >> > >>> >>>>>>>>>>>>>> layers. My main concern about the high-level >> approach is >> > >>> >>>>>> that >> > >>> >>>>>>> it >> > >>> >>>>>>>>>>> spans >> > >>> >>>>>>>>>>>>>> these boundaries, making it harder to maintain in the >> > >>> >>> long >> > >>> >>>>>> run. >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav >> Kozlovski < >> > >>> >>>>>>>>>>>>>> stanis...@confluent.io> wrote: >> > >>> >>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> Hi Ron, Rajini >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> Thanks for summarizing the discussion so far, Ron! >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> 1) How often do we have such long-lived connection >> > >>> >>>>>> idleness >> > >>> >>>>>>>> (e.g >> > >>> >>>>>>>>>>> 5-10 >> > >>> >>>>>>>>>>>>>>> minutes) in practice? >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> 3) I agree that retries for re-authentication are >> > >>> >>> useful. >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> 4) The interleaving of requests sounds like a great >> > >>> >>>>>> feature >> > >>> >>>>>>> to >> > >>> >>>>>>>>>>> have, >> > >>> >>>>>>>>>>>> but >> > >>> >>>>>>>>>>>>>>> the tradeoff against code complexity is a tough >> one. I >> > >>> >>>>>> would >> > >>> >>>>>>>>>>>> personally >> > >>> >>>>>>>>>>>>>> go >> > >>> >>>>>>>>>>>>>>> with the simpler approach since you could always add >> > >>> >>>>>>>> interleaving >> > >>> >>>>>>>>>>> on >> > >>> >>>>>>>>>>>>>> top if >> > >>> >>>>>>>>>>>>>>> the community decides the latency should be better. >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> Best, >> > >>> >>>>>>>>>>>>>>> Stanislav >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino < >> > >>> >>>>>>>> rndg...@gmail.com >> > >>> >>>>>>>>>> >> > >>> >>>>>>>>>>>>>> wrote: >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> Hi everyone. I've updated the PR to reflect the >> > >>> >>> latest >> > >>> >>>>>>>>>>> conclusions >> > >>> >>>>>>>>>>>>>> from >> > >>> >>>>>>>>>>>>>>>> this ongoing discussion. The KIP still needs the >> > >>> >>>>>> suggested >> > >>> >>>>>>>>>>>> updates; I >> > >>> >>>>>>>>>>>>>>> will >> > >>> >>>>>>>>>>>>>>>> do that later this week. II agree with Rajini that >> > >>> >>>>>> some >> > >>> >>>>>>>>>>> additional >> > >>> >>>>>>>>>>>>>>>> feedback from the community at large would be very >> > >>> >>>>>> helpful >> > >>> >>>>>>> at >> > >>> >>>>>>>>>>> this >> > >>> >>>>>>>>>>>>>> point >> > >>> >>>>>>>>>>>>>>> in >> > >>> >>>>>>>>>>>>>>>> time. >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> Here's where we stand. >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> We have 2 separate implementations for >> > >>> >>>>>> re-authenticating >> > >>> >>>>>>>>> existing >> > >>> >>>>>>>>>>>>>>>> connections -- a so-called "low-level" approach >> and a >> > >>> >>>>>>>>> (relatively >> > >>> >>>>>>>>>>>>>>> speaking) >> > >>> >>>>>>>>>>>>>>>> "high-level" approach -- and we agree on how they >> > >>> >>>>>> should be >> > >>> >>>>>>>>>>>> compared. >> > >>> >>>>>>>>>>>>>>>> Specifically, Rajini has provided a mechanism that >> > >>> >>>>>> works >> > >>> >>>>>>> at a >> > >>> >>>>>>>>>>>>>> relatively >> > >>> >>>>>>>>>>>>>>>> low level in the stack by intercepting network >> sends >> > >>> >>>>>> and >> > >>> >>>>>>>>> queueing >> > >>> >>>>>>>>>>>>>> them up >> > >>> >>>>>>>>>>>>>>>> while re-authentication happens; then the queued >> > >>> >>> sends >> > >>> >>>>>> are >> > >>> >>>>>>>> sent >> > >>> >>>>>>>>>>>> after >> > >>> >>>>>>>>>>>>>>>> re-authentication succeeds, or the connection is >> > >>> >>>>>> closed if >> > >>> >>>>>>>>>>>>>>>> re-authentication fails. See the prototype commit >> at >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> https://github.com/rajinisivaram/kafka/commit/ >> > >>> >>>>>>> b9d711907ad843 >> > >>> >>>>>>>>>>>>>>> c11d17e80d6743bfb1d4e3f3fd >> > >>> >>>>>>>>>>>>>>>> . >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> I have an implementation that works higher up in >> the >> > >>> >>>>>> stack; >> > >>> >>>>>>>> it >> > >>> >>>>>>>>>>>> injects >> > >>> >>>>>>>>>>>>>>>> authentication requests into the KafkaClient via a >> > >>> >>> new >> > >>> >>>>>>> method >> > >>> >>>>>>>>>>> added >> > >>> >>>>>>>>>>>> to >> > >>> >>>>>>>>>>>>>>> that >> > >>> >>>>>>>>>>>>>>>> interface, and the implementation (i.e. >> > >>> >>> NetworkClient) >> > >>> >>>>>>> sends >> > >>> >>>>>>>>> them >> > >>> >>>>>>>>>>>> when >> > >>> >>>>>>>>>>>>>>>> poll() is called. The updated PR is available at >> > >>> >>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/. >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> Here is how we think these two approaches should be >> > >>> >>>>>>> compared. >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> 1) *When re-authentication starts*. The low-level >> > >>> >>>>>> approach >> > >>> >>>>>>>>>>>> initiates >> > >>> >>>>>>>>>>>>>>>> re-authentication only if/when the connection is >> > >>> >>>>>> actually >> > >>> >>>>>>>> used, >> > >>> >>>>>>>>>>> so >> > >>> >>>>>>>>>>>> it >> > >>> >>>>>>>>>>>>>> may >> > >>> >>>>>>>>>>>>>>>> start after the existing credential expires; the >> > >>> >>>>>> current PR >> > >>> >>>>>>>>>>>>>>> implementation >> > >>> >>>>>>>>>>>>>>>> injects re-authentication requests into the >> existing >> > >>> >>>>>> flow, >> > >>> >>>>>>>> and >> > >>> >>>>>>>>>>>>>>>> re-authentication starts immediately regardless of >> > >>> >>>>>> whether >> > >>> >>>>>>> or >> > >>> >>>>>>>>> not >> > >>> >>>>>>>>>>>> the >> > >>> >>>>>>>>>>>>>>>> connection is being used for something else. >> Rajini >> > >>> >>>>>>> believes >> > >>> >>>>>>>>> the >> > >>> >>>>>>>>>>>>>>> low-level >> > >>> >>>>>>>>>>>>>>>> approach can be adjusted to re-authenticate idle >> > >>> >>>>>>> connections >> > >>> >>>>>>>>>>>> (Rajini, >> > >>> >>>>>>>>>>>>>> I >> > >>> >>>>>>>>>>>>>>>> don't see how this can be done without injecting >> > >>> >>>>>> requests, >> > >>> >>>>>>>>> which >> > >>> >>>>>>>>>>> is >> > >>> >>>>>>>>>>>>>> what >> > >>> >>>>>>>>>>>>>>>> the high-level connection is doing, but I am >> probably >> > >>> >>>>>>> missing >> > >>> >>>>>>>>>>>>>> something >> > >>> >>>>>>>>>>>>>>> -- >> > >>> >>>>>>>>>>>>>>>> no need to go into it unless it is easy to >> explain.) >> > >>> >>>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>>> 2) *When requests not related to re-authentication >> > >>> >>> are >> > >>> >>>>>> able >> > >>> >>>>>>>> to >> > >>> >>>>>>>>>>> use >> > >>> >>>>>>>>>>>> the >> > >>> >>>>>>>>>>>>>>>> re-authenticating connection*. The low-level >> > >>> >>> approach >> > >>> >>>>>>>> finishes >> > >>> >>>>>>>>>>>>>>>> re-authentication completely before allowing >> anything >> > >>> >>>>>> else >> > >>> >>>>>>> to >> > >>> >>>>>>>>>>>> traverse >> > >>> >>>>> > >