HI again, Rajini. Would we ever want the max session time to be different across different SASL mechanisms? I'm wondering, now that we are supporting all SASL mechanisms via this KIP, if we still need to prefix this config with the "[listener].[mechanism]." prefix. I've kept the prefix in the KIP for now, but it would be easier to just set it once for all mechanisms, and I don't see that as being a problem. Let me know what you think.
Ron On Mon, Sep 17, 2018 at 9:51 PM Ron Dagostino <rndg...@gmail.com> wrote: > Hi Rajini. The KIP is updated. Aside from a once-over to make sure it is > all accurate, I think we need to confirm the metrics. The decision to not > reject authentications that use tokens with too-long a lifetime allowed the > metrics to be simpler. I decided that in addition to tracking these > metrics on the broker: > > failed-reauthentication-{rate,total} and > successful-reauthentication-{rate,total} > > we simply need one more set of broker metrics to track the subset of > clients clients that are not upgraded to v2.1.0 and are still using a V0 > SaslAuthenticateRequest: > > failed-v0-authentication-{rate,total} and > successful-v0-authentication-{rate,total} > > See the Migration section of the KIP for details of how this would be used. > > I wonder if we need a broker metric documenting the number of "expired" > sessions killed by the broker since it would be the same as > successful-v0-authentication-total. I've eliminated that from the KIP for > now. Thoughts? > > There is also a client-side metric for re-authentication latency tracking > (still unnamed -- do you have a preference?) > > I think we're close to being able to put this KIP up for a vote. > > Ron > > > On Mon, Sep 17, 2018 at 2:45 PM Ron Dagostino <rndg...@gmail.com> wrote: > >> <<<extra field alongside errors, not in the opaque body as mentioned in >> the KIP >> Right, that was a concern I had mentioned in a follow-up email. Agree it >> should go alongside errors. >> >> <<<SCRAM for delegation token based authentication where credentials have >> <<<an expiry time, so we do need to be able to set individual credential >> <<<lifetimes, where the server knows the expiry time but client may not >> Ah, ok, I was not aware that this case existed. Agreed, for consistency, >> server will always send it back to the client. >> >> <<<I was trying to avoid this altogether [SaslClient negotiated property >> for lifetime] >> Since we now agree that the server will send it back, yes, there is no >> need for this. >> >> <<<wasn't entirely clear about the purpose of >> [sasl.login.refresh.reauthenticate.enable] >> I think this might have been more useful when we weren't necessarily >> going to support all SASL mechanisms and/or when the broker was not going >> to advertise the fact that it supported re-authentication. You are >> correct, now that we support it for all SASL mechanisms and we are bumping >> an API version, I think it is okay to simply enable it wherever both the >> client and server meet the required versions. >> >> <<<wasn't entirely clear about the purpose of [connections.max.reauth.ms] >> <<<wasn't expecting that we would reject >> <<<tokens or tickets simply because they were too long-lived >> <<<tickets are granted by a 3rd party authority >> <<<Client re-authenticates even though token was not >> <<<refreshed. Does this matter? >> I was going under the assumption that it would matter, but based on your >> pushback I just realized that the same functionality can be implemented as >> part of token validation if there is a desire to limit token lifetimes to a >> certain max value (and the token validator has to be provided in production >> anyway since all we provide out-of-the-box is the unsecured validator). So >> I'm willing to abandon this check as part of re-authentication. >> >> I'll adjust the KIP accordingly a bit later. Thanks for the continued >> feedback/discussion. >> >> Ron >> >> >> >> >> On Mon, Sep 17, 2018 at 2:10 PM Rajini Sivaram <rajinisiva...@gmail.com> >> wrote: >> >>> Hi Ron, >>> >>> >>> *1) Is there a reason to communicate this value back to the client when >>> the >>> client already knows it? It's an extra network round-trip, and since the >>> SASL round-tripsare defined by the spec I'm not certain adding an extra >>> round-trip is acceptable.* >>> >>> I wasn't suggesting an extra round-trip for propagating session >>> lifetime. I >>> was expecting session lifetime to be added to the last SASL_AUTHENTICATE >>> response from the broker. Because SASL is a challenge-response mechanism, >>> SaslServer knows when the response being sent is the last one and hence >>> can >>> send the session lifetime in the response (in the same way as we >>> propagate >>> errors). I was expecting this to be added as an extra field alongside >>> errors, not in the opaque body as mentioned in the KIP. The opaque byte >>> array strictly conforms to the SASL mechanism wire protocol and we want >>> to >>> keep it that way. >>> >>> As you have said, we don't need server to propagate session lifetime for >>> OAUTHBEARER since client knows the token lifetime. But server also knows >>> the credential lifetime and by having the server decide the lifetime, we >>> can use the same code path for all mechanisms. If we only have a constant >>> max lifetime in the server, for PLAIN and SCRAM we will end up having the >>> same lifetime for all credentials with no ability to set actual expiry. >>> We >>> use SCRAM for delegation token based authentication where credentials >>> have >>> an expiry time, so we do need to be able to set individual credential >>> lifetimes, where the server knows the expiry time but client may not. >>> >>> *2) I also just realized that if the client is to learn the credential >>> lifetime we wouldn't want to put special-case code in the Authenticator >>> for >>> GSSAPI and OAUTHBEARER; we would want to expose the value generically, >>> probably as a negotiated property on the SaslClient instance.* >>> >>> I was trying to avoid this altogether. Client doesn't need to know >>> credential lifetime. Server asks the client to re-authenticate within its >>> session lifetime. >>> >>> 3) From the KIP, I wasn't entirely clear about the purpose of the two >>> configs: >>> >>> sasl.login.refresh.reauthenticate.enable: Do we need this? Client knows >>> if >>> broker version supports re-authentication based on the SASL_AUTHENTICATE >>> version returned in ApiVersionsResponse. Client knows if broker is >>> configured to enable re-authentication based on session lifetime returned >>> in SaslAuthenticateResponse. If broker has re-authentication configured >>> and >>> client supports re-authentication, you would always want re-authenticate. >>> So I wasn't sure why we need a config to opt in or out on the >>> client-side. >>> >>> >>> connections.max.reauth.ms: We obviously need a broker-side config. Not >>> entirely sure about the semantics of the config that drives >>> re-authentication. In particular, I wasn't expecting that we would reject >>> tokens or tickets simply because they were too long-lived. Since tokens >>> or >>> tickets are granted by a 3rd party authority, I am not sure if clients >>> will >>> always have control over the lifetime. Do we need to support any more >>> scenarios than these: >>> >>> >>> A) reauth.ms=10,credential.lifetime.ms=10 : Broker sets >>> session.lifetime=10, >>> so this works. >>> >>> B) reauth.ms=10, credential.lifetime.ms=5 : Broker sets >>> session.lifetime=5, >>> so this works. >>> C) reauth.ms=10, credential.lifetime.ms=20 : Broker sets >>> session.lifetime=10. Client re-authenticates even though token was not >>> refreshed. Does this matter? >>> D) reauth.ms=Long.MAX_VALUE, credential.lifetime.ms=10: Broker sets >>> session.lifetime=10, client re-authenticates based on credential expiry. >>> E) reauth.ms=0 (default), credential.lifetime.ms=10 : Broker sets >>> session.lifetime=0, Broker doesn't terminate sessions, client doesn't >>> re-authenticate. We generate useful metrics. >>> F) reauth.ms=0 (default),no lifetime for credential (e.g. PLAIN): Broker >>> sets session.lifetime=0, Broker doesn't terminate sessions, client >>> doesn't >>> re-authenticate >>> G) reauth.ms=10,no lifetime for credential (e.g. PLAIN) : Broker sets >>> session.lifetime=10. Client re-authenticates. >>> >>> I would have thought that D) is the typical scenario for OAuth/Kerberos >>> to >>> respect token expiry time. G) would be typical scenario for PLAIN to >>> force >>> re-authenication at regular intervals. A/B/C are useful to force >>> re-authentication in scenarios where you might check for credential >>> revocation in the server. And E/F are useful to disable re-authentication >>> and generate metrics (also the default behaviour useful during >>> migration). >>> Have I missed something? >>> >>> >>> On Mon, Sep 17, 2018 at 4:27 PM, Ron Dagostino <rndg...@gmail.com> >>> wrote: >>> >>> > Hi yet again, Rajini. I also just realized that if the client is to >>> learn >>> > the credential lifetime we wouldn't want to put special-case code in >>> the >>> > Authenticator for GSSAPI and OAUTHBEARER; we would want to expose the >>> value >>> > generically, probably as a negotiated property on the SaslClient >>> instance. >>> > We might be talking about making the negotiated property key part of >>> the >>> > public API. In other words, the SaslClient would be responsible for >>> > exposing the credential (i.e. token or ticket) lifetime at a well-known >>> > negotiated property name, such as "Credential.Lifetime" and putting a >>> Long >>> > value there if there is a lifetime. That well-klnown key (e.g. >>> > "Credential.Lifetime") would be part of the public API, right? >>> > >>> > Ron >>> > >>> > On Mon, Sep 17, 2018 at 11:03 AM Ron Dagostino <rndg...@gmail.com> >>> wrote: >>> > >>> > > Hi again, Rajini. After thinking about this a little while, it >>> occurs to >>> > > me that maybe the communication of max session lifetime should occur >>> via >>> > > SASL_HANDSHAKE after all. Here's why. The value communicated is >>> the max >>> > > session lifetime allowed, and the client can assume it is the the >>> session >>> > > lifetime for that particular session unless the particular SASL >>> mechanism >>> > > could result in a shorter session that would be obvious to the >>> client and >>> > > the server. In particular, for OAUTHBEARER, the session lifetime >>> will be >>> > > the token lifetime, which the client and server will both know. Is >>> > there a >>> > > reason to communicate this value back to the client when the client >>> > already >>> > > knows it? It's an extra network round-trip, and since the SASL >>> > round-trips >>> > > are defined by the spec I'm not certain adding an extra round-trip is >>> > > acceptable. Even if we decide we can add it, it helps with latency >>> if we >>> > > don't. >>> > > >>> > > Kerberos may be a bit different -- I don't know if the broker can >>> learn >>> > > the session lifetime. If it can then the same thing holds -- both >>> client >>> > > and server will know the session lifetime and there is no reason to >>> > > communicate it back. If the server can't learn the lifetime then I >>> don't >>> > > think adding an extra SASL_AUTHENTICATE round trip is going to help, >>> > anyway. >>> > > >>> > > Also, by communicating the max session lifetime in the SASL_HANDSHAKE >>> > > response, both OAUTHBEARER and GSSAPI clients will be able to know >>> before >>> > > sending any SASL_AUTHENTICATE requests whether their credential >>> violates >>> > > the maximum. This allows a behaving client to give a good error >>> message. >>> > > A malicious client would ignore the value and send a longer-lived >>> > > credential, and then that would be rejected on the server side. >>> > > >>> > > I'm still good with ExpiringCredential not needing to be public. >>> > > >>> > > What do you think? >>> > > >>> > > Ron >>> > > >>> > > Ron >>> > > >>> > > Ron >>> > > >>> > > On Mon, Sep 17, 2018 at 9:13 AM Ron Dagostino <rndg...@gmail.com> >>> wrote: >>> > > >>> > >> Hi Rajini. I've updated the KIP to reflect the decision to fully >>> > support >>> > >> this for all SASL mechanisms and to not require the >>> ExpiringCredential >>> > >> interface to be public. >>> > >> >>> > >> Ron >>> > >> >>> > >> On Mon, Sep 17, 2018 at 6:55 AM Ron Dagostino <rndg...@gmail.com> >>> > wrote: >>> > >> >>> > >>> Actually, I think the metric remains the same assuming the >>> > >>> authentication fails when the token lifetime is too long. >>> Negative >>> > max >>> > >>> config on server counts what would have been killed because maybe >>> > client >>> > >>> re-auth is not turned on; positive max enables the kill, which is >>> > counted >>> > >>> by a second metric as proposed. The current proposal had already >>> > stated >>> > >>> that a non-zero value would disallow an authentication with a token >>> > that >>> > >>> has too large a lifetime, and that is still the case, I think. But >>> > let’s >>> > >>> make sure we are on the same page about all this. >>> > >>> >>> > >>> Ron >>> > >>> >>> > >>> > On Sep 17, 2018, at 6:42 AM, Ron Dagostino <rndg...@gmail.com> >>> > wrote: >>> > >>> > >>> > >>> > Hi Rajini. I see what you are saying. The background login >>> refresh >>> > >>> thread does have to factor into the decision for OAUTHBEARER >>> because >>> > there >>> > >>> is no new token to re-authenticate with until that refresh succeeds >>> > >>> (similarly with Kerberos), but I think you are right that the >>> interface >>> > >>> doesn’t necessarily have to be public. The server does decide the >>> > time for >>> > >>> PLAIN and the SCRAM-related mechanisms under the current proposal, >>> but >>> > it >>> > >>> is done in SaslHandshakeResponse, and at that point the server >>> won’t >>> > have >>> > >>> any token yet; making it happen via SaslAuthenticateRequest at the >>> > very end >>> > >>> does allow the server to know everything for all mechanisms. >>> > >>> > >>> > >>> > I see two potential issues to discuss. First, the server must >>> > >>> communicate a time that exceeds the (token or ticket) refresh >>> period >>> > on the >>> > >>> client. Assuming it communicates the expiration time of the >>> > token/ticket, >>> > >>> that’s the best it can do. So I think that’ll be fine. >>> > >>> > >>> > >>> > The second issue is what happens if the server is configured to >>> > accept >>> > >>> a max value — say, an hour — and the token is good for longer. I >>> > assumed >>> > >>> that the client should not be allowed to authenticate in the first >>> > place >>> > >>> because it could then simply re-authenticate with the same token >>> after >>> > an >>> > >>> hour, which defeats the motivations for this KIP. So do we agree >>> the >>> > max >>> > >>> token lifetime allowed will be enforced at authentication time? >>> > Assuming >>> > >>> so, then we need to discuss migration. >>> > >>> > >>> > >>> > The current proposal includes a metric that can be used to >>> identify >>> > if >>> > >>> an OAUTHBEARER client is misconfigured — the number of connections >>> that >>> > >>> would have been killed (could be non-zero when the configured max >>> is a >>> > >>> negative number). Do we still want this type of an indication, and >>> if >>> > so, >>> > >>> is it still done the same way — a negative number for max, but >>> instead >>> > of >>> > >>> counting the number of kills that would have been done it counts >>> the >>> > number >>> > >>> of authentications that would have been failed? >>> > >>> > >>> > >>> > Ron >>> > >>> > >>> > >>> >> On Sep 17, 2018, at 6:06 AM, Rajini Sivaram < >>> > rajinisiva...@gmail.com> >>> > >>> wrote: >>> > >>> >> >>> > >>> >> Hi Ron, >>> > >>> >> >>> > >>> >> Thinking a bit more about other SASL mechanisms, I think the >>> issue >>> > is >>> > >>> that >>> > >>> >> in the current proposal, clients decide the re-authentication >>> > period. >>> > >>> For >>> > >>> >> mechanisms like PLAIN or SCRAM, we would actually want server to >>> > >>> determine >>> > >>> >> the re-authentication interval. For example, the PLAIN or SCRAM >>> > >>> database >>> > >>> >> could specify the expiry time for each principal, or the broker >>> > could >>> > >>> be >>> > >>> >> configured with a single expiry time for all principals of that >>> > >>> mechanism. >>> > >>> >> For OAuth, brokers do know the expiry time. Haven't figured out >>> what >>> > >>> to do >>> > >>> >> with Kerberos, but in any case for broker-side connection >>> > termination >>> > >>> on >>> > >>> >> expiry, we need the broker to know/decide the expiry time. So I >>> > would >>> > >>> like >>> > >>> >> to suggest a slightly different approach to managing >>> > >>> re-authentications. >>> > >>> >> >>> > >>> >> 1) Instead of changing SASL_HANDSHAKE version number, we bump up >>> > >>> >> SASL_AUTHENTICATE version number. >>> > >>> >> 2) In the final SASL_AUTHENTICATE response, broker returns the >>> > expiry >>> > >>> time >>> > >>> >> of this authenticated session. This is the interval after which >>> > >>> broker will >>> > >>> >> terminate the connection If it wasn't re-authenticated. >>> > >>> >> 3) We no longer need the public interface change to add ` >>> > >>> >> org.apache.kafka.common.security.expiring.ExpiringCredential` >>> for >>> > the >>> > >>> >> client-side. Instead we schedule the next re-authentication on >>> the >>> > >>> client >>> > >>> >> based on the expiry time provided by the server (some time >>> earlier >>> > >>> than the >>> > >>> >> expiry). >>> > >>> >> 4) If client uses SASL_AUTHENTICATE v0, broker will not return >>> > expiry >>> > >>> time. >>> > >>> >> The connection will be terminated if that feature is enabled >>> (the >>> > >>> same code >>> > >>> >> path as client failing to re-authenticte on time). >>> > >>> >> >>> > >>> >> Thoughts? >>> > >>> >> >>> > >>> >>> On Sat, Sep 15, 2018 at 3:06 AM, Ron Dagostino < >>> rndg...@gmail.com> >>> > >>> wrote: >>> > >>> >>> >>> > >>> >>> Hi everyone. I've updated the KIP to reflect all discussion to >>> > date, >>> > >>> >>> including the decision to go with the low-level approach. This >>> > >>> latest >>> > >>> >>> version of the KIP includes the above " >>> connections.max.reauth.ms" >>> > >>> >>> proposal, >>> > >>> >>> which I know has not been discussed yet. It mentions new >>> metrics, >>> > >>> some of >>> > >>> >>> which may not have been discussed either (and names are missing >>> > from >>> > >>> some >>> > >>> >>> of them). Regardless, this new version is the closest yet to a >>> > >>> version >>> > >>> >>> that can be put to a vote next week. >>> > >>> >>> >>> > >>> >>> Ron >>> > >>> >>> >>> > >>> >>>> On Fri, Sep 14, 2018 at 8:59 PM Ron Dagostino < >>> rndg...@gmail.com> >>> > >>> wrote: >>> > >>> >>>> >>> > >>> >>>> Minor correction: I'm proposing "connections.max.reauth.ms" >>> as >>> > the >>> > >>> >>>> broker-side configuration property, not " >>> > >>> >>>> connections.max.expired.credentials.ms". >>> > >>> >>>> >>> > >>> >>>> Ron >>> > >>> >>>> >>> > >>> >>>>> On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino < >>> rndg...@gmail.com >>> > > >>> > >>> wrote: >>> > >>> >>>>> >>> > >>> >>>>> Hi Rajini. I'm going to choose * >>> > >>> connections.max.expired.credentials.ms >>> > >>> >>>>> <http://connections.max.expired.credentials.ms>* as the >>> option >>> > >>> name >>> > >>> >>>>> because it is consistent with the comment I made in the >>> section >>> > >>> about >>> > >>> >>> how >>> > >>> >>>>> to get the client and server to agree on credential lifetime: >>> > >>> >>>>> >>> > >>> >>>>> "We could add a new API call so that clients could ask >>> servers >>> > for >>> > >>> the >>> > >>> >>>>> lifetime they use, or we could extend the >>> > >>> SaslHandshakeRequest/Response >>> > >>> >>> API >>> > >>> >>>>> call to include that information in the server's response – >>> the >>> > >>> client >>> > >>> >>>>> would then adopt that value" >>> > >>> >>>>> >>> > >>> >>>>> >>> > >>> >>>>> We set the config option value on the broker (with the " >>> > >>> >>>>> listener.name.mechanism." prefix), and we will return the >>> > >>> configured >>> > >>> >>>>> value in the SaslHandshakeResponse (requiring a wire format >>> > change >>> > >>> in >>> > >>> >>>>> addition to a version bump). The value (referred to as "X" >>> > below) >>> > >>> can >>> > >>> >>> be >>> > >>> >>>>> negative, zero, or positive and is to be interpreted as >>> follows: >>> > >>> >>>>> >>> > >>> >>>>> X = 0: Fully disable. The server will respond to >>> > >>> re-authentications, >>> > >>> >>> but >>> > >>> >>>>> it won't kill any connections due to expiration, and it won't >>> > track >>> > >>> >>> either >>> > >>> >>>>> of the 2 metrics mentioned below. >>> > >>> >>>>> >>> > >>> >>>>> Now, a couple of definitions for when X != 0: >>> > >>> >>>>> >>> > >>> >>>>> 1) We define the *maximum allowed expiration time* to be the >>> time >>> > >>> >>>>> determined by the point when (re-)authentication occurs plus >>> |X| >>> > >>> >>>>> milliseconds. >>> > >>> >>>>> 2) We define the *requested expiration time* to be the >>> maximum >>> > >>> allowed >>> > >>> >>>>> expiration time except for the case of an OAuth bearer >>> token, in >>> > >>> which >>> > >>> >>> case >>> > >>> >>>>> it is the point at which the token expires. (Kerberos >>> presumably >>> > >>> also >>> > >>> >>>>> specifies a ticket lifetime, but frankly I am not in a >>> position >>> > to >>> > >>> do >>> > >>> >>> any >>> > >>> >>>>> Kerberos-related coding in time for a 2.1.0 release and would >>> > >>> prefer to >>> > >>> >>>>> ignore this piece of information for this KIP -- would it be >>> > >>> acceptable >>> > >>> >>> to >>> > >>> >>>>> have someone else add it later?). >>> > >>> >>>>> >>> > >>> >>>>> Based on these definitions, we define the behavior as >>> follows: >>> > >>> >>>>> >>> > >>> >>>>> *X > 0: Fully enable* >>> > >>> >>>>> A) The server will reject any >>> authentication/re-authentication >>> > >>> attempt >>> > >>> >>>>> when the requested expiration time is after the maximum >>> allowed >>> > >>> >>> expiration >>> > >>> >>>>> time (which could only happen with an OAuth bearer token, >>> > assuming >>> > >>> we >>> > >>> >>> skip >>> > >>> >>>>> dealing with Kerberos for now). >>> > >>> >>>>> B) The server will kill connections that are used beyond the >>> > >>> requested >>> > >>> >>>>> expiration time. >>> > >>> >>>>> C) A broker metric will be maintained that documents the >>> number >>> > >>> >>>>> connections killed by the broker. This metric will be >>> non-zero >>> > if >>> > >>> a >>> > >>> >>> client >>> > >>> >>>>> is connecting to the broker with re-authentication either >>> > >>> unavailable >>> > >>> >>> (i.e. >>> > >>> >>>>> an older client) or disabled. >>> > >>> >>>>> >>> > >>> >>>>> *X < 0: Partially enable* >>> > >>> >>>>> A) Same as above: the server will reject any >>> > >>> >>>>> authentication/re-authentication attempt when the requested >>> > >>> expiration >>> > >>> >>> time >>> > >>> >>>>> is after the maximum allowed expiration time (which could >>> only >>> > >>> happen >>> > >>> >>> with >>> > >>> >>>>> an OAuth bearer token, assuming we skip dealing with >>> Kerberos for >>> > >>> now). >>> > >>> >>>>> B) The server will **not** kill connections that are used >>> beyond >>> > >>> the >>> > >>> >>>>> requested expiration time. >>> > >>> >>>>> C) A broker metric will be maintained that documents the >>> number >>> > of >>> > >>> API >>> > >>> >>>>> requests unrelated to re-authentication that are made over a >>> > >>> connection >>> > >>> >>>>> beyond the requested expiration time. This metric will be >>> > helpful >>> > >>> for >>> > >>> >>>>> operators to ensure that all clients are properly upgraded >>> and >>> > >>> >>>>> re-authenticating before fully enabling the server-side >>> > >>> >>>>> expired-connection-kill functionality (i.e. by changing the >>> value >>> > >>> from a >>> > >>> >>>>> negative number to a positive number): this metric will be >>> zero >>> > >>> across >>> > >>> >>> all >>> > >>> >>>>> brokers when it is safe to fully enable the server-side >>> feature. >>> > >>> >>>>> >>> > >>> >>>>> Thoughts? >>> > >>> >>>>> >>> > >>> >>>>> Ron >>> > >>> >>>>> >>> > >>> >>>>> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram < >>> > >>> >>> rajinisiva...@gmail.com> >>> > >>> >>>>> wrote: >>> > >>> >>>>> >>> > >>> >>>>>> It will be good to shorten the config name. We have a config >>> > >>> named ` >>> > >>> >>>>>> connections.max.idle.ms`. We could add something like ` >>> > >>> >>>>>> connections.max.expired.credentials.ms`. As you said, it >>> would >>> > be >>> > >>> >>>>>> prefixed >>> > >>> >>>>>> with listener and SASL mechanism name. We should be able to >>> > >>> support >>> > >>> >>>>>> connection termination in future even with SSL, so perhaps >>> we >>> > >>> don't >>> > >>> >>> want >>> > >>> >>>>>> the `sasl.server.` prefix (we just need to validate whether >>> the >>> > >>> config >>> > >>> >>> is >>> > >>> >>>>>> supported for the protocol). You can choose whether a >>> boolean >>> > >>> flag or a >>> > >>> >>>>>> time interval makes more sense. >>> > >>> >>>>>> >>> > >>> >>>>>> For the client-side, the KIP explains how to make it work >>> for >>> > >>> other >>> > >>> >>>>>> mechanisms and we can leave it that. Not convinced about >>> > >>> server-side >>> > >>> >>>>>> though. It seems to me that we probably would require API >>> > changes >>> > >>> to >>> > >>> >>> make >>> > >>> >>>>>> it work. Basically the proposed approach works only for >>> > >>> OAUTHBEARER. >>> > >>> >>>>>> Since >>> > >>> >>>>>> we have to bump up SaslHandshakeRequest version in this >>> KIP, it >>> > >>> will be >>> > >>> >>>>>> good to work out if we need to change the request or flow to >>> > make >>> > >>> this >>> > >>> >>>>>> work >>> > >>> >>>>>> for other mechanisms. I haven't figured out exactly what is >>> > >>> needed, but >>> > >>> >>>>>> will think about it and get back next week. In the >>> meantime, you >>> > >>> can >>> > >>> >>> get >>> > >>> >>>>>> the KIP up-to-date with the config, migration path etc. and >>> get >>> > it >>> > >>> >>> ready >>> > >>> >>>>>> to >>> > >>> >>>>>> start vote next week. >>> > >>> >>>>>> >>> > >>> >>>>>> >>> > >>> >>>>>> >>> > >>> >>>>>> >>> > >>> >>>>>> >>> > >>> >>>>>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino < >>> > rndg...@gmail.com >>> > >>> > >>> > >>> >>>>>> wrote: >>> > >>> >>>>>> >>> > >>> >>>>>>> Hi Rajini. >>> > >>> >>>>>>> >>> > >>> >>>>>>> Agreed, we will bump the SaslHandshakeRequest API number >>> (no >>> > wire >>> > >>> >>>>>> format >>> > >>> >>>>>>> change, of course). >>> > >>> >>>>>>> >>> > >>> >>>>>>> Agreed, responding to re-authentication will always be >>> enabled >>> > >>> on the >>> > >>> >>>>>>> broker since it is initiated by the client. >>> > >>> >>>>>>> >>> > >>> >>>>>>> Agreed, connection termination by the broker will be >>> opt-in. >>> > I'm >>> > >>> >>>>>> thinking >>> > >>> >>>>>>> *sasl.server.disconnect.expired.credential.connections. >>> > enable*, >>> > >>> >>>>>> prefixed >>> > >>> >>>>>>> with listener and SASL mechanism name in lower-case so it >>> can >>> > be >>> > >>> >>>>>> opt-in at >>> > >>> >>>>>>> the granularity of a SASL mechanism; for example, " >>> > >>> >>>>>>> listener.name.sasl_ssl.oauthbearer.sasl.server. >>> > >>> >>>>>>> disconnect.expired.credential.connections.enable=true >>> > >>> >>>>>>> ". It is long, so if you have a preference for a shorter >>> name >>> > >>> let me >>> > >>> >>>>>> know >>> > >>> >>>>>>> and we'll go with it instead. >>> > >>> >>>>>>> >>> > >>> >>>>>>> Agreed, additional metrics will be helpful. I'll add them >>> to >>> > the >>> > >>> >>> KIP. >>> > >>> >>>>>>> >>> > >>> >>>>>>> <<<need to work out exactly how this works with all SASL >>> > >>> mechanisms >>> > >>> >>>>>>> <<<not sure we have the data required on the broker or a >>> way of >>> > >>> >>>>>>> <<< extending the [GSSAPI] mechanism >>> > >>> >>>>>>> Not sure I agree here. The paragraph I added to the KIP >>> > >>> describes >>> > >>> >>> how >>> > >>> >>>>>> I >>> > >>> >>>>>>> think this can be done. Given that description, do you >>> still >>> > >>> feel >>> > >>> >>>>>> Kerberos >>> > >>> >>>>>>> will not be possible? If Kerberos presents a significant >>> > hurdle >>> > >>> >>> then I >>> > >>> >>>>>>> don't think that should prevent us from doing it for other >>> > >>> mechanisms >>> > >>> >>>>>> -- I >>> > >>> >>>>>>> would rather state that it isn't supported with GSSAPI due >>> to >>> > >>> >>>>>> limitations >>> > >>> >>>>>>> in Kerberos itself than not have it for OAUTHBEARER, for >>> > example. >>> > >>> >>> And >>> > >>> >>>>>>> right now I don't think we need more than a high-level >>> > >>> description of >>> > >>> >>>>>> how >>> > >>> >>>>>>> this could be done. In other words, I think we have this >>> > >>> covered, >>> > >>> >>>>>> unless >>> > >>> >>>>>>> there is a fundamental problem due to Kerberos not making >>> data >>> > >>> (i.e. >>> > >>> >>>>>> ticket >>> > >>> >>>>>>> expiration) available on the server. I want to submit >>> this >>> > KIP >>> > >>> for >>> > >>> >>> a >>> > >>> >>>>>> vote >>> > >>> >>>>>>> early next week in the hope of having it accepted by the >>> > Monday, >>> > >>> 9/24 >>> > >>> >>>>>>> deadline for the 2.1.0 release, and if that happens I >>> believe I >>> > >>> can >>> > >>> >>>>>> get a >>> > >>> >>>>>>> PR into really good shape soon thereafter (I'm working on >>> it >>> > >>> now). >>> > >>> >>>>>>> >>> > >>> >>>>>>> Ron >>> > >>> >>>>>>> >>> > >>> >>>>>>> >>> > >>> >>>>>>> >>> > >>> >>>>>>> >>> > >>> >>>>>>> >>> > >>> >>>>>>> >>> > >>> >>>>>>> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram < >>> > >>> >>>>>> rajinisiva...@gmail.com> >>> > >>> >>>>>>> wrote: >>> > >>> >>>>>>> >>> > >>> >>>>>>>> Hi Ron, >>> > >>> >>>>>>>> >>> > >>> >>>>>>>> 1) Yes, we should update the version of >>> > `SaslHandshakeRequest`. >>> > >>> >>>>>> Clients >>> > >>> >>>>>>>> would attempt to re-authenticate only if broker's >>> > >>> >>>>>> SaslHandshakeRequest >>> > >>> >>>>>>>> version >=2. >>> > >>> >>>>>>>> 2) I think we should enable broker's re-authentication and >>> > >>> >>> connection >>> > >>> >>>>>>>> termination code regardless of client version. >>> > Re-authentication >>> > >>> >>>>>> could be >>> > >>> >>>>>>>> always enabled since it is triggered only by clients and >>> > broker >>> > >>> >>> could >>> > >>> >>>>>>> just >>> > >>> >>>>>>>> handle it. Connection termination should be opt-in, but >>> should >>> > >>> work >>> > >>> >>>>>> with >>> > >>> >>>>>>>> clients of any version. If turned on and clients are of an >>> > older >>> > >>> >>>>>> version, >>> > >>> >>>>>>>> this would simply force reconnection, which should be ok. >>> > >>> >>> Additional >>> > >>> >>>>>>>> metrics to monitor expired connections may be useful >>> anyway. >>> > >>> >>>>>>>> >>> > >>> >>>>>>>> We also need to work out exactly how this works with all >>> SASL >>> > >>> >>>>>> mechanisms. >>> > >>> >>>>>>>> In particular, we have ensure that we can make it work >>> with >>> > >>> >>> Kerberos >>> > >>> >>>>>>> since >>> > >>> >>>>>>>> we use the implementation from the JRE and I am not sure >>> we >>> > have >>> > >>> >>> the >>> > >>> >>>>>> data >>> > >>> >>>>>>>> required on the broker or a way of extending the >>> mechanism. >>> > >>> >>>>>>>> >>> > >>> >>>>>>>> Thoughts? >>> > >>> >>>>>>>> >>> > >>> >>>>>>>> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino < >>> > >>> rndg...@gmail.com> >>> > >>> >>>>>>> wrote: >>> > >>> >>>>>>>> >>> > >>> >>>>>>>>> Hi Rajini. I'm thinking about how we deal with >>> migration. >>> > >>> For >>> > >>> >>>>>>> example, >>> > >>> >>>>>>>>> let's say we have an existing 2.0.0 cluster using >>> > >>> >>> SASL/OAUTHBEARER >>> > >>> >>>>>> and >>> > >>> >>>>>>> we >>> > >>> >>>>>>>>> want to use this feature. The desired end state is to >>> have >>> > all >>> > >>> >>>>>> clients >>> > >>> >>>>>>>> and >>> > >>> >>>>>>>>> brokers migrated to a version that supports the feature >>> (call >>> > >>> it >>> > >>> >>>>>> 2.x) >>> > >>> >>>>>>>> with >>> > >>> >>>>>>>>> the feature turned on. We need to document the supported >>> > >>> path(s) >>> > >>> >>>>>> to >>> > >>> >>>>>>> this >>> > >>> >>>>>>>>> end state. >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> Here are a couple of scenarios with implications: >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> 1) *Migrate client to 2.x and turn the feature on but >>> broker >>> > is >>> > >>> >>>>>> still >>> > >>> >>>>>>> at >>> > >>> >>>>>>>>> 2.0.* In this case the client is going to get an error >>> when >>> > it >>> > >>> >>>>>> sends >>> > >>> >>>>>>> the >>> > >>> >>>>>>>>> SaslHandshakeRequest. We want to avoid this. It seems >>> to me >>> > >>> the >>> > >>> >>>>>>> client >>> > >>> >>>>>>>>> needs to know if the broker has been upgraded to the >>> > necessary >>> > >>> >>>>>> version >>> > >>> >>>>>>>>> before trying to re-authenticate, which makes me believe >>> we >>> > >>> need >>> > >>> >>> to >>> > >>> >>>>>>> bump >>> > >>> >>>>>>>>> the API version number in 2.x and the client has to >>> check to >>> > >>> see >>> > >>> >>>>>> if the >>> > >>> >>>>>>>> max >>> > >>> >>>>>>>>> version the broker supports meets a minimum standard >>> before >>> > >>> >>> trying >>> > >>> >>>>>> to >>> > >>> >>>>>>>>> re-authenticate. Do you agree? >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> 2) *Migrate broker to 2.x and turn the feature on but >>> client >>> > is >>> > >>> >>>>>> still >>> > >>> >>>>>>> at >>> > >>> >>>>>>>>> 2.0*. In this case the broker is going to end up killing >>> > >>> >>>>>> connections >>> > >>> >>>>>>>>> periodically. Again, we want to avoid this. It's >>> tempting >>> > to >>> > >>> >>> say >>> > >>> >>>>>>>> "don't >>> > >>> >>>>>>>>> do this" but I wonder if we can require installations to >>> > >>> upgrade >>> > >>> >>>>>> all >>> > >>> >>>>>>>>> clients before turning the feature on at the brokers. >>> > >>> Certainly >>> > >>> >>>>>> we can >>> > >>> >>>>>>>> say >>> > >>> >>>>>>>>> "don't do this" for inter-broker clients -- in other >>> words, >>> > we >>> > >>> >>> can >>> > >>> >>>>>> say >>> > >>> >>>>>>>> that >>> > >>> >>>>>>>>> all brokers in a cluster should be upgraded before the >>> > feature >>> > >>> is >>> > >>> >>>>>>> turned >>> > >>> >>>>>>>> on >>> > >>> >>>>>>>>> for any one of them -- but I don't know about our >>> ability to >>> > >>> say >>> > >>> >>>>>> it for >>> > >>> >>>>>>>>> non-broker clients. >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> Now we consider the cases where both the brokers and the >>> > >>> clients >>> > >>> >>>>>> have >>> > >>> >>>>>>>> been >>> > >>> >>>>>>>>> upgraded to 2.x. When and where should the feature be >>> turned >>> > >>> on? >>> > >>> >>>>>> The >>> > >>> >>>>>>>>> inter-broker case is interesting because I don't think >>> think >>> > we >>> > >>> >>> can >>> > >>> >>>>>>>> require >>> > >>> >>>>>>>>> installations to restart every broker with a new config >>> where >>> > >>> the >>> > >>> >>>>>>> feature >>> > >>> >>>>>>>>> is turned on at the same time. Do you agree that we >>> cannot >>> > >>> >>> require >>> > >>> >>>>>>> this >>> > >>> >>>>>>>>> and therefore must support installations restarting >>> brokers >>> > >>> with >>> > >>> >>> a >>> > >>> >>>>>> new >>> > >>> >>>>>>>>> config at whatever pace they need -- which may be quite >>> slow >>> > >>> >>>>>> relative >>> > >>> >>>>>>> to >>> > >>> >>>>>>>>> token lifetimes? Assuming you do agree, then there is >>> going >>> > to >>> > >>> >>> be >>> > >>> >>>>>> the >>> > >>> >>>>>>>> case >>> > >>> >>>>>>>>> where some brokers are going to have the feature turned >>> on >>> > and >>> > >>> >>> some >>> > >>> >>>>>>>> clients >>> > >>> >>>>>>>>> (definitely inter-broker clients at a minimum) are going >>> to >>> > >>> have >>> > >>> >>>>>> the >>> > >>> >>>>>>>>> feature turned off. >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> The above discussion assumes a single config on the >>> broker >>> > side >>> > >>> >>>>>> that >>> > >>> >>>>>>>> turns >>> > >>> >>>>>>>>> on both the inter-broker client re-authentication >>> feature as >>> > >>> well >>> > >>> >>>>>> as >>> > >>> >>>>>>> the >>> > >>> >>>>>>>>> server-side expired-connection-kill feature, but now I'm >>> > >>> thinking >>> > >>> >>>>>> we >>> > >>> >>>>>>> need >>> > >>> >>>>>>>>> the ability to turn these features on independently, plus >>> > maybe >>> > >>> >>> we >>> > >>> >>>>>>> need a >>> > >>> >>>>>>>>> way to monitor the number of "active, expired" >>> connections to >>> > >>> the >>> > >>> >>>>>>> server >>> > >>> >>>>>>>> so >>> > >>> >>>>>>>>> that operators can be sure that all clients have been >>> > >>> >>>>>> upgraded/enabled >>> > >>> >>>>>>>>> before turning on the server-side expired-connection-kill >>> > >>> >>> feature. >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> So the migration plan would be as follows: >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> 1) Upgrade all brokers to 2.x. >>> > >>> >>>>>>>>> 2) After all brokers are upgraded, turn on >>> re-authentication >>> > >>> for >>> > >>> >>>>>> all >>> > >>> >>>>>>>>> brokers at whatever rate is desired -- just eventually, >>> at >>> > some >>> > >>> >>>>>> point, >>> > >>> >>>>>>>> get >>> > >>> >>>>>>>>> the client-side feature turned on for all brokers so that >>> > >>> >>>>>> inter-broker >>> > >>> >>>>>>>>> connections are re-authenticating. >>> > >>> >>>>>>>>> 3) In parallel with (1) and (2) above, upgrade clients >>> to 2.x >>> > >>> and >>> > >>> >>>>>> turn >>> > >>> >>>>>>>>> their re-authentication feature on. Clients will check >>> the >>> > API >>> > >>> >>>>>> version >>> > >>> >>>>>>>> and >>> > >>> >>>>>>>>> only re-authenticate to a broker that has also been >>> upgraded >>> > >>> >>> (note >>> > >>> >>>>>> that >>> > >>> >>>>>>>> the >>> > >>> >>>>>>>>> ability of a broker to respond to a re-authentication >>> cannot >>> > be >>> > >>> >>>>>> turned >>> > >>> >>>>>>>> off >>> > >>> >>>>>>>>> -- it is always on beginning with version 2.x, and it >>> just >>> > sits >>> > >>> >>>>>> there >>> > >>> >>>>>>>> doing >>> > >>> >>>>>>>>> nothing if it isn't exercised by an enabled client) >>> > >>> >>>>>>>>> 4) After (1), (2), and (3) are complete, check the 2.x >>> broker >>> > >>> >>>>>> metrics >>> > >>> >>>>>>> to >>> > >>> >>>>>>>>> confirm that there are no "active, expired" connections. >>> > Once >>> > >>> >>> you >>> > >>> >>>>>> are >>> > >>> >>>>>>>>> satisfied that (1), (2), and (3) are indeed complete you >>> can >>> > >>> >>>>>> enable the >>> > >>> >>>>>>>>> server-side expired-connection-kill feature on each >>> broker >>> > via >>> > >>> a >>> > >>> >>>>>>> restart >>> > >>> >>>>>>>> at >>> > >>> >>>>>>>>> whatever pace is desired. >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> Comments? >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> Ron >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>> On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino < >>> > >>> rndg...@gmail.com >>> > >>> >>>> >>> > >>> >>>>>>> wrote: >>> > >>> >>>>>>>>> >>> > >>> >>>>>>>>>> Ok, I am tempted to just say we go with the low-level >>> > approach >>> > >>> >>>>>> since >>> > >>> >>>>>>> it >>> > >>> >>>>>>>>> is >>> > >>> >>>>>>>>>> the quickest and seems to meet the clear requirements. >>> We >>> > can >>> > >>> >>>>>> always >>> > >>> >>>>>>>>> adjust >>> > >>> >>>>>>>>>> later if we get to clarity on other requirements or we >>> > decide >>> > >>> >>> we >>> > >>> >>>>>> need >>> > >>> >>>>>>>> to >>> > >>> >>>>>>>>>> approach it differently for whatever reason. But in the >>> > >>> >>>>>> meantime, >>> > >>> >>>>>>>> before >>> > >>> >>>>>>>>>> fully committing to this decision, I would appreciate >>> > another >>> > >>> >>>>>>>> perspective >>> > >>> >>>>>>>>>> if someone has one. >>> > >>> >>>>>>>>>> >>> > >>> >>>>>>>>>> Ron >>> > >>> >>>>>>>>>> >>> > >>> >>>>>>>>>> On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram < >>> > >>> >>>>>>>> rajinisiva...@gmail.com> >>> > >>> >>>>>>>>>> wrote: >>> > >>> >>>>>>>>>> >>> > >>> >>>>>>>>>>> Hi Ron, >>> > >>> >>>>>>>>>>> >>> > >>> >>>>>>>>>>> Yes, I would leave out retries from this KIP for now. >>> In >>> > the >>> > >>> >>>>>> future, >>> > >>> >>>>>>>> if >>> > >>> >>>>>>>>>>> there is a requirement for supporting retries, we can >>> > >>> consider >>> > >>> >>>>>> it. I >>> > >>> >>>>>>>>> think >>> > >>> >>>>>>>>>>> we can support retries with either approach if we >>> needed >>> > to, >>> > >>> >>>>>> but it >>> > >>> >>>>>>>>> would >>> > >>> >>>>>>>>>>> be better to do it along with other changes required to >>> > >>> >>> support >>> > >>> >>>>>>>>>>> authentication servers that are not highly available. >>> > >>> >>>>>>>>>>> >>> > >>> >>>>>>>>>>> For maintainability, I am biased, so it will be good >>> to get >>> > >>> >>> the >>> > >>> >>>>>>>>>>> perspective >>> > >>> >>>>>>>>>>> of others in the community :-) >>> > >>> >>>>>>>>>>> >>> > >>> >>>>>>>>>>> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino < >>> > >>> >>>>>> rndg...@gmail.com> >>> > >>> >>>>>>>>> wrote: >>> > >>> >>>>>>>>>>> >>> > >>> >>>>>>>>>>>> Hi Rajini. Here is some feedback/some things I >>> thought >>> > of. >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> First, I just realized that from a timing perspective >>> > that I >>> > >>> >>>>>> am >>> > >>> >>>>>>> not >>> > >>> >>>>>>>>> sure >>> > >>> >>>>>>>>>>>> retry is going to be necessary. The background login >>> > >>> >>> refresh >>> > >>> >>>>>>> thread >>> > >>> >>>>>>>>>>>> triggers re-authentication when it refreshes the >>> client's >>> > >>> >>>>>>>> credential. >>> > >>> >>>>>>>>>>> The >>> > >>> >>>>>>>>>>>> OAuth infrastructure has to be available in order for >>> this >>> > >>> >>>>>> refresh >>> > >>> >>>>>>>> to >>> > >>> >>>>>>>>>>>> succeed (the background thread repeatedly retries if >>> it >>> > >>> >>> can't >>> > >>> >>>>>>>> refresh >>> > >>> >>>>>>>>>>> the >>> > >>> >>>>>>>>>>>> credential, and that retry loop handles any temporary >>> > >>> >>>>>> outage). If >>> > >>> >>>>>>>>>>> clients >>> > >>> >>>>>>>>>>>> are told to re-authenticate when the credential is >>> > refreshed >>> > >>> >>>>>> **and >>> > >>> >>>>>>>>> they >>> > >>> >>>>>>>>>>>> actually re-authenticate at that point** then it is >>> highly >>> > >>> >>>>>>> unlikely >>> > >>> >>>>>>>>> that >>> > >>> >>>>>>>>>>>> the OAuth infrastructure would fail within those >>> > intervening >>> > >>> >>>>>>>>>>> milliseconds. >>> > >>> >>>>>>>>>>>> So we don't need re-authentication retry in this KIP >>> as >>> > long >>> > >>> >>>>>> as >>> > >>> >>>>>>>> retry >>> > >>> >>>>>>>>>>>> starts immediately. The low-level prototype as >>> currently >>> > >>> >>>>>> coded >>> > >>> >>>>>>>>> doesn't >>> > >>> >>>>>>>>>>>> actually start re-authentication until the connection >>> is >>> > >>> >>>>>>>> subsequently >>> > >>> >>>>>>>>>>> used, >>> > >>> >>>>>>>>>>>> and that could take a while. But then again, if the >>> > >>> >>>>>> connection >>> > >>> >>>>>>>> isn't >>> > >>> >>>>>>>>>>> used >>> > >>> >>>>>>>>>>>> for some period of time, then the lost value of >>> having to >>> > >>> >>>>>> abandon >>> > >>> >>>>>>>> the >>> > >>> >>>>>>>>>>>> connection is lessened anyway. Plus, as you pointed >>> out, >>> > >>> >>>>>> OAuth >>> > >>> >>>>>>>>>>>> infrastructure is assumed to be highly available. >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> Does this makes sense, and would you be willing to say >>> > that >>> > >>> >>>>>> retry >>> > >>> >>>>>>>>> isn't >>> > >>> >>>>>>>>>>> a >>> > >>> >>>>>>>>>>>> necessary requirement? I'm tempted but would like to >>> hear >>> > >>> >>>>>> your >>> > >>> >>>>>>>>>>> perspective >>> > >>> >>>>>>>>>>>> first. >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> Interleaving/latency and maintainability (more than >>> lines >>> > of >>> > >>> >>>>>> code) >>> > >>> >>>>>>>> are >>> > >>> >>>>>>>>>>> the >>> > >>> >>>>>>>>>>>> two remaining areas of comparison. I did not add the >>> > >>> >>>>>>>> maintainability >>> > >>> >>>>>>>>>>> issue >>> > >>> >>>>>>>>>>>> to the KIP yet, but before I do I thought I would >>> address >>> > it >>> > >>> >>>>>> here >>> > >>> >>>>>>>>> first >>> > >>> >>>>>>>>>>> to >>> > >>> >>>>>>>>>>>> see if we can come to consensus on it because I'm not >>> > sure I >>> > >>> >>>>>> see >>> > >>> >>>>>>> the >>> > >>> >>>>>>>>>>>> high-level approach as being hard to maintain (yet -- >>> I >>> > >>> >>> could >>> > >>> >>>>>> be >>> > >>> >>>>>>>>>>>> convinced/convince myself; we'll see). I do want to >>> make >>> > >>> >>>>>> sure we >>> > >>> >>>>>>>> are >>> > >>> >>>>>>>>> on >>> > >>> >>>>>>>>>>>> the same page about what is required to add >>> > >>> >>> re-authentication >>> > >>> >>>>>>>> support >>> > >>> >>>>>>>>> in >>> > >>> >>>>>>>>>>>> the high-level case. Granted, the amount is zero in >>> the >>> > >>> >>>>>> low-level >>> > >>> >>>>>>>>> case, >>> > >>> >>>>>>>>>>>> and it doesn't get any better that that, but the >>> amount in >>> > >>> >>> the >>> > >>> >>>>>>>>>>> high-level >>> > >>> >>>>>>>>>>>> case is very low -- just a few lines of code. For >>> > example: >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> KafkaAdminClient: >>> > >>> >>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/commits/ >>> > >>> >>> 4fa70f38b9 >>> > >>> >>>>>>>>>>>> d33428ff98b64a3a2bd668f5f28c38#diff- >>> > >>> >>>>>>> 6869b8fccf6b098cbcb0676e8ceb26a7 >>> > >>> >>>>>>>>>>>> It is the same few lines of code for KafkaConsumer, >>> > >>> >>>>>> KafkaProducer, >>> > >>> >>>>>>>>>>>> WorkerGroupMember, and TransactionMarkerChannelManager >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> The two synchronous I/O use cases are >>> > >>> >>>>>> ControllerChannelManager and >>> > >>> >>>>>>>>>>>> ReplicaFetcherBlockingSend (via >>> ReplicaFetcherThread), and >>> > >>> >>>>>> they >>> > >>> >>>>>>>>> require >>> > >>> >>>>>>>>>>> a >>> > >>> >>>>>>>>>>>> little bit more -- but not much. >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> Thoughts? >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> Ron >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino < >>> > >>> >>>>>> rndg...@gmail.com> >>> > >>> >>>>>>>>>>> wrote: >>> > >>> >>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> Thanks, Rajini. Before I digest/respond to that, >>> here's >>> > >>> >>> an >>> > >>> >>>>>>> update >>> > >>> >>>>>>>>>>> that I >>> > >>> >>>>>>>>>>>>> just completed. >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> I added a commit to the PR ( >>> > >>> >>>>>>>>>>> https://github.com/apache/kafka/pull/5582/) >>> > >>> >>>>>>>>>>>>> that implements server-side kill of expired >>> OAUTHBEARER >>> > >>> >>>>>>>> connections. >>> > >>> >>>>>>>>>>> No >>> > >>> >>>>>>>>>>>>> tests yet since we still haven't settled on a final >>> > >>> >>> approach >>> > >>> >>>>>>>>>>> (low-level >>> > >>> >>>>>>>>>>>> vs. >>> > >>> >>>>>>>>>>>>> high-level). >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> I also updated the KIP to reflect the latest >>> discussion >>> > >>> >>> and >>> > >>> >>>>>> PR >>> > >>> >>>>>>> as >>> > >>> >>>>>>>>>>>> follows: >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> - Include support for brokers killing connections as >>> > >>> >>>>>> part of >>> > >>> >>>>>>>> this >>> > >>> >>>>>>>>>>> KIP >>> > >>> >>>>>>>>>>>>> (rather than deferring it to a future KIP as was >>> > >>> >>>>>> originally >>> > >>> >>>>>>>>>>>> mentioned; the >>> > >>> >>>>>>>>>>>>> PR now includes it as mentioned -- it was very easy >>> to >>> > >>> >>>>>> add) >>> > >>> >>>>>>>>>>>>> - Added metrics (they will mirror existing ones >>> related >>> > >>> >>>>>> to >>> > >>> >>>>>>>>>>>>> authentications; I have not added those to the PR) >>> > >>> >>>>>>>>>>>>> - Updated the implementation description to reflect >>> the >>> > >>> >>>>>>> current >>> > >>> >>>>>>>>>>> state >>> > >>> >>>>>>>>>>>>> of the PR, which is a high-level, one-size-fits-all >>> > >>> >>>>>> approach >>> > >>> >>>>>>>> (as >>> > >>> >>>>>>>>>>>> opposed to >>> > >>> >>>>>>>>>>>>> my initial, even-higher-level approach) >>> > >>> >>>>>>>>>>>>> - Added a "Rejected Alternative" for the first >>> version >>> > >>> >>>>>> of the >>> > >>> >>>>>>>> PR, >>> > >>> >>>>>>>>>>>>> which injected requests directly into synchronous >>> I/O >>> > >>> >>>>>>> clients' >>> > >>> >>>>>>>>>>> queues >>> > >>> >>>>>>>>>>>>> - Added a "Rejected Alternative" for the low-level >>> > >>> >>>>>> approach >>> > >>> >>>>>>> as >>> > >>> >>>>>>>>>>>>> suggested, but of course we have not formally >>> decided >>> > >>> >>> to >>> > >>> >>>>>>> reject >>> > >>> >>>>>>>>>>> this >>> > >>> >>>>>>>>>>>>> approach or adopt the current PR implementation. >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> I'll think about where we stand some more before >>> > >>> >>> responding >>> > >>> >>>>>>> again. >>> > >>> >>>>>>>>>>>> Thanks >>> > >>> >>>>>>>>>>>>> for the above reply. >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> Ron >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram < >>> > >>> >>>>>>>>>>> rajinisiva...@gmail.com> >>> > >>> >>>>>>>>>>>>> wrote: >>> > >>> >>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> Hi Ron, >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> Thank you for summarising, I think it covers the >>> > >>> >>>>>> differences >>> > >>> >>>>>>>>> between >>> > >>> >>>>>>>>>>> the >>> > >>> >>>>>>>>>>>>>> two approaches well. >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> A few minor points to answer the questions in there: >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> 1) When re-authetication is initiated in the >>> Selector >>> > >>> >>>>>> during >>> > >>> >>>>>>>>> poll(), >>> > >>> >>>>>>>>>>> we >>> > >>> >>>>>>>>>>>>>> can >>> > >>> >>>>>>>>>>>>>> move an idle channel to re-authentication state. It >>> is >>> > >>> >>>>>> similar >>> > >>> >>>>>>> to >>> > >>> >>>>>>>>>>>>>> injecting >>> > >>> >>>>>>>>>>>>>> requests, but achieved by changing channel back to >>> > >>> >>>>>>> authenticating >>> > >>> >>>>>>>>>>> state. >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> 3) To clarify why I think re-authentication should >>> fit >>> > in >>> > >>> >>>>>> with >>> > >>> >>>>>>>> our >>> > >>> >>>>>>>>>>>>>> authentication design: My point was not about a >>> specific >>> > >>> >>>>>>>> connection >>> > >>> >>>>>>>>>>>> being >>> > >>> >>>>>>>>>>>>>> usable or not usable. It was about what happens at >>> the >>> > >>> >>>>>> client >>> > >>> >>>>>>> API >>> > >>> >>>>>>>>>>> level. >>> > >>> >>>>>>>>>>>>>> Our client API (producer/consumer/admin client etc.) >>> > >>> >>>>>> currently >>> > >>> >>>>>>>>> assume >>> > >>> >>>>>>>>>>>> that >>> > >>> >>>>>>>>>>>>>> a single broker authentication failure is a fatal >>> error >>> > >>> >>>>>> that is >>> > >>> >>>>>>>>> never >>> > >>> >>>>>>>>>>>>>> retried because we assume that broker only ever >>> fails an >>> > >>> >>>>>>>>>>> authentication >>> > >>> >>>>>>>>>>>>>> request if credentials are invalid. If we ever >>> decide to >>> > >>> >>>>>>> support >>> > >>> >>>>>>>>>>> cases >>> > >>> >>>>>>>>>>>>>> where broker occasionally fails an authentication >>> > request >>> > >>> >>>>>> due >>> > >>> >>>>>>> to >>> > >>> >>>>>>>> a >>> > >>> >>>>>>>>>>>>>> transient failure, we need to do more around how we >>> > >>> >>> handle >>> > >>> >>>>>>>>>>>> authentication >>> > >>> >>>>>>>>>>>>>> failures in clients. We may decide that it is ok to >>> > close >>> > >>> >>>>>> the >>> > >>> >>>>>>>>>>> connection >>> > >>> >>>>>>>>>>>>>> for authentication and not for re-authentication as >>> you >>> > >>> >>>>>>>> mentioned, >>> > >>> >>>>>>>>>>> but >>> > >>> >>>>>>>>>>>> we >>> > >>> >>>>>>>>>>>>>> need to change the way this disconnection is >>> handled by >>> > >>> >>>>>>> clients. >>> > >>> >>>>>>>> So >>> > >>> >>>>>>>>>>> IMO, >>> > >>> >>>>>>>>>>>>>> we >>> > >>> >>>>>>>>>>>>>> should either add support for transient retriable >>> > >>> >>>>>>> authentication >>> > >>> >>>>>>>>>>>> failures >>> > >>> >>>>>>>>>>>>>> properly or not retry for any scenario. Personally, >>> I >>> > >>> >>> don't >>> > >>> >>>>>>> think >>> > >>> >>>>>>>>> we >>> > >>> >>>>>>>>>>>> would >>> > >>> >>>>>>>>>>>>>> want to retry all authentication failures even if >>> it is >>> > a >>> > >>> >>>>>>>>>>>>>> re-authentication, I think we could (at some point >>> in >>> > >>> >>>>>> future), >>> > >>> >>>>>>>>> allow >>> > >>> >>>>>>>>>>>>>> brokers to return an error code that indicates that >>> it >>> > >>> >>> is a >>> > >>> >>>>>>>>> transient >>> > >>> >>>>>>>>>>>>>> broker-side failure rather than invalid credentials >>> and >>> > >>> >>>>>> handle >>> > >>> >>>>>>>> the >>> > >>> >>>>>>>>>>> error >>> > >>> >>>>>>>>>>>>>> differently. I see no reason at that point why we >>> > >>> >>> wouldn't >>> > >>> >>>>>>> handle >>> > >>> >>>>>>>>>>>>>> authentication and re-authentication in the same >>> way. >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> 4) As you said, the high-level approach would be >>> bigger >>> > >>> >>>>>> than >>> > >>> >>>>>>> the >>> > >>> >>>>>>>>>>>> low-level >>> > >>> >>>>>>>>>>>>>> approach in terms of LOC. But I wouldn't be too >>> > concerned >>> > >>> >>>>>> about >>> > >>> >>>>>>>>>>> lines of >>> > >>> >>>>>>>>>>>>>> code. My bigger concern was about modularity. Our >>> > >>> >>> security >>> > >>> >>>>>> code >>> > >>> >>>>>>>> is >>> > >>> >>>>>>>>>>>> already >>> > >>> >>>>>>>>>>>>>> complex, protocols like Kerberos and SSL that we use >>> > from >>> > >>> >>>>>> the >>> > >>> >>>>>>> JRE >>> > >>> >>>>>>>>>>> make >>> > >>> >>>>>>>>>>>>>> problem diagnosis hard. Async I/O makes the >>> networking >>> > >>> >>> code >>> > >>> >>>>>>>>> complex. >>> > >>> >>>>>>>>>>> You >>> > >>> >>>>>>>>>>>>>> need to understand networking layer to work with the >>> > >>> >>>>>> security >>> > >>> >>>>>>>>> layer, >>> > >>> >>>>>>>>>>> but >>> > >>> >>>>>>>>>>>>>> the rest of the code base doesn't rely on knowledge >>> of >>> > >>> >>>>>>>>>>> network/security >>> > >>> >>>>>>>>>>>>>> layers. My main concern about the high-level >>> approach is >>> > >>> >>>>>> that >>> > >>> >>>>>>> it >>> > >>> >>>>>>>>>>> spans >>> > >>> >>>>>>>>>>>>>> these boundaries, making it harder to maintain in >>> the >>> > >>> >>> long >>> > >>> >>>>>> run. >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav >>> Kozlovski < >>> > >>> >>>>>>>>>>>>>> stanis...@confluent.io> wrote: >>> > >>> >>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> Hi Ron, Rajini >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> Thanks for summarizing the discussion so far, Ron! >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> 1) How often do we have such long-lived connection >>> > >>> >>>>>> idleness >>> > >>> >>>>>>>> (e.g >>> > >>> >>>>>>>>>>> 5-10 >>> > >>> >>>>>>>>>>>>>>> minutes) in practice? >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> 3) I agree that retries for re-authentication are >>> > >>> >>> useful. >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> 4) The interleaving of requests sounds like a great >>> > >>> >>>>>> feature >>> > >>> >>>>>>> to >>> > >>> >>>>>>>>>>> have, >>> > >>> >>>>>>>>>>>> but >>> > >>> >>>>>>>>>>>>>>> the tradeoff against code complexity is a tough >>> one. I >>> > >>> >>>>>> would >>> > >>> >>>>>>>>>>>> personally >>> > >>> >>>>>>>>>>>>>> go >>> > >>> >>>>>>>>>>>>>>> with the simpler approach since you could always >>> add >>> > >>> >>>>>>>> interleaving >>> > >>> >>>>>>>>>>> on >>> > >>> >>>>>>>>>>>>>> top if >>> > >>> >>>>>>>>>>>>>>> the community decides the latency should be better. >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> Best, >>> > >>> >>>>>>>>>>>>>>> Stanislav >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino < >>> > >>> >>>>>>>> rndg...@gmail.com >>> > >>> >>>>>>>>>> >>> > >>> >>>>>>>>>>>>>> wrote: >>> > >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> Hi everyone. I've updated the PR to reflect the >>> > >>> >>> latest >>> > >>> >>>>>>>>>>> conclusions >>> > >>> >>>>>>>>>>>>>> from >>> > >>> >>>>>>>>>>>>>>>> this ongoing discussion. The KIP still needs the >>> > >>> >>>>>> suggested >>> > >>> >>>>>>>>>>>> updates; I >>> > >>> >>>>>>>>>>>>>>> will >>> > >>> >>>>>>>>>>>>>>>> do that later this week. II agree with Rajini >>> that >>> > >>> >>>>>> some >>> > >>> >>>>>>>>>>> additional >>> > >>> >>>>>>>>>>>>>>>> feedback from the community at large would be very >>> > >>> >>>>>> helpful >>> > >>> >>>>>>> at >>> > >>> >>>>>>>>>>> this >>> > >>> >>>>>>>>>>>>>> point >>> > >>> >>>>>>>>>>>>>>> in >>> > >>> >>>>>>>>>>>>>>>> time. >>> > >>> >>>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> Here's where we stand. >>> > >>> >>>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> We have 2 separate implementations for >>> > >>> >>>>>> re-authenticating >>> > >>> >>>>>>>>> existing >>> > >>> >>>>>>>>>>>>>>>> connections -- a so-called "low-level" approach >>> and a >>> > >>> >>>>>>>>> (relatively >>> > >>> >>>>>>>>>>>>>>> speaking) >>> > >>> >>>>>>>>>>>>>>>> "high-level" approach -- and we agree on how they >>> > >>> >>>>>> should be >>> > >>> >>>>>>>>>>>> compared. >>> > >>> >>>>>>>>>>>>>>>> Specifically, Rajini has provided a mechanism that >>> > >>> >>>>>> works >>> > >>> >>>>>>> at a >>> > >>> >>>>>>>>>>>>>> relatively >>> > >>> >>>>>>>>>>>>>>>> low level in the stack by intercepting network >>> sends >>> > >>> >>>>>> and >>> > >>> >>>>>>>>> queueing >>> > >>> >>>>>>>>>>>>>> them up >>> > >>> >>>>>>>>>>>>>>>> while re-authentication happens; then the queued >>> > >>> >>> sends >>> > >>> >>>>>> are >>> > >>> >>>>>>>> sent >>> > >>> >>>>>>>>>>>> after >>> > >>> >>>>>>>>>>>>>>>> re-authentication succeeds, or the connection is >>> > >>> >>>>>> closed if >>> > >>> >>>>>>>>>>>>>>>> re-authentication fails. See the prototype >>> commit at >>> > >>> >>>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> https://github.com/rajinisivaram/kafka/commit/ >>> > >>> >>>>>>> b9d711907ad843 >>> > >>> >>>>>>>>>>>>>>> c11d17e80d6743bfb1d4e3f3fd >>> > >>> >>>>>>>>>>>>>>>> . >>> > >>> >>>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> I have an implementation that works higher up in >>> the >>> > >>> >>>>>> stack; >>> > >>> >>>>>>>> it >>> > >>> >>>>>>>>>>>> injects >>> > >>> >>>>>>>>>>>>>>>> authentication requests into the KafkaClient via a >>> > >>> >>> new >>> > >>> >>>>>>> method >>> > >>> >>>>>>>>>>> added >>> > >>> >>>>>>>>>>>> to >>> > >>> >>>>>>>>>>>>>>> that >>> > >>> >>>>>>>>>>>>>>>> interface, and the implementation (i.e. >>> > >>> >>> NetworkClient) >>> > >>> >>>>>>> sends >>> > >>> >>>>>>>>> them >>> > >>> >>>>>>>>>>>> when >>> > >>> >>>>>>>>>>>>>>>> poll() is called. The updated PR is available at >>> > >>> >>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/. >>> > >>> >>>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> Here is how we think these two approaches should >>> be >>> > >>> >>>>>>> compared. >>> > >>> >>>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>>> 1) *When re-authentication starts*. The low-level >>> > >>> >>>>>> approach >>> > >>> >>>>>>>>>>>> initiates >>> > >>> >>>>>>>>>>>>>>>> re-authentication only if/when the connection is >>> > >>> >>>>>> actually >>> > >>> >>>>>>>> used, >>> > >>> >>>>>>>>>>> so >>> > >>> >>>>>>>>>>>> it >>> > >>> >>>>>>>>>>>>>> may >>> > >>> >>>>>>>>>>>>>>>> start after the existing credential expires; the >>> > >>> >>>>>> current PR >>> > >>> >>>>>>>>>>>>>>> implementation >>> > >>> >>>>>>>>>>>>>>>> injects re-authentication requests into the >>> existing >>> > >>> >>>>>> flow, >>> > >>> >>>>>>>> and >>> > >>> >>>>>>>>>>>>>>>> re-authentication starts immediately regardless of >>> > >>> >>>>>> whether >>> > >>> >>>>>>> or >>> > >>> >>>>>>>>> not >>> > >>> >>>>>>>>>>>> the >>> > >>> >>>>>>>>>>>>>>>> connection is being used for something else. >>> Rajini >>> > >>> >>>>>>> believes >>> > >>> >>>>>>>>> the >>> > >>> >>>>>> >> >>