Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Ron Dagostino Sat, 08 Sep 2018 05:28:26 -0700

Hi yet again, Rajini.  If we decide that interleaving is impossible with
the low-level approach but we still want to at least support the
possibility given that it reduces the size of latency spikes, then I do
have a suggestion.  I documented in the KIP how I rejected a single,
one-size-fits-all queue approach because we don't know when poll() will be
called in the synchronous I/O use case.  An advantage of such an approach
-- if it would have worked, which it didn't -- is that it would have been
transparent to the owner of the NetworkClient instance (this is one of the
big advantages of the low-level approach, after all).  But we could make
the one-size-fits-all queue approach work if we are willing to impose some
requirements on synchronous I/O users of NetworkClient.  Specifically, we
could add a method to the org.apache.kafka.clients.KafkaClient interface
(which NetworkClient implements) as follows:


    /**

     * Return true if any node has re-authentication requests waiting to be
sent. A

     * call to {@link #poll(long, long)} is required to send such requests.
An owner

     * of this instance that does not implement a run loop to repeatedly
call

     * {@link #poll(long, long)} but instead only sends requests
synchronously

     * on-demand must call this method periodically -- and invoke

     * {@link #poll(long, long)} if the return value is {@code true} -- to
ensure

     * that any re-authentication requests that have ben injected are sent
in a

     * timely fashion.

     *

     * @return true if any node has re-authentication requests waiting to
be sent

     */

    default boolean hasReauthenticationRequests() {

        return false;

    }

If we are willing to add an additional method like this and make the minor
adjustments to the code associated with the few synchronous use cases then
I think the high-level approach will work.  We would then have the
possibility of further parameterizing the various synchronous I/O use
cases.  For example, there are currently 3:

ControllerChannelManager
KafkaProducer, via Sender
ReplicaFetcherBlockingSend, via ReplicaFetcherThread

Is it conceivable that one of these use cases might be more
latency-sensitive than others and might desire interleaving?  A high-level
approach would give us the option of configuring each use case accordingly.

Ron

On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino <rndg...@gmail.com> wrote:

> Hi again, Rajini.  It occurs to me that from a *behavior* perspective
> there are really 3 fundamental differences between the low-level approach
> you provided and the high-level approach as it currently exists in the PR:
>
> 1) *When re-authentication starts*.  The low-level approach initiates
> re-authentication only if/when the connection is actually used, so it may
> start after the existing credential expires; the current PR implementation
> injects re-authentication requests into the existing flow, and
> re-authentication starts immediately regardless of whether or not the
> connection is being used for something else.
>
> 2) *When requests not related to re-authentication can use the
> re-authenticating connection*.  The low-level approach finishes
> re-authentication completely before allowing anything else to travers the
> connection; the current PR implementation interleaves re-authentication
> requests with existing flow requests.
>
> 3) *What happens if re-authentication fails*.  The low-level approach
> results in a closed connection and does not support retries -- at least as
> currently implemented; the current PR implementation supports retries.
>
> Do you agree that these are the (only?) behavioral differences?
>
> For these facets of behavior, I wonder what the requirements are for this
> feature.  I think they are as follows:
>
> 1) *When re-authentication starts*: I don't think we have a hard
> requirement here when considered in isolation -- whether re-authentication
> starts immediately or is delayed until the connection is used probably
> doesn't matter.
>
> 2) *When requests not related to re-authentication can use the
> re-authenticating connection*: there is a tradeoff here between latency
> and ability to debug re-authentication problems.  Delaying use of the
> connection until re-authentication finishes results in bigger latency
> spikes but keeps re-authentication requests somewhat together; interleaving
> minimizes the size of individual latency spikes but adds some separation
> between the requests.
>
> 3) *What happens if re-authentication fails*: I believe we have a clear
> requirement for retry capability given what I have written previously.  Do
> you agree?  Note that while the current low-level implementation does not
> support retry, I have been thinking about how that can be done, and I am
> pretty sure it can be.  We can keep the old authenticators on the client
> and server sides and put them back into place if the re-authentication
> fails; we would also have to make sure the server side doesn't delay any
> failed re-authentication result and also doesn't close the connection upon
> re-authentication failure.  I think I see how to do all of that.
>
> There are some interactions between the above requirements.  If
> re-authentication can't start immediately and has to wait for the
> connection to be used then that precludes interleaving because we can't be
> sure that the credential will be active by the time it is used -- if it
> isn't active, and we allow interleaving, then requests not related to
> re-authentication will fail if server-side
> connection-close-due-to-expired-credential functionality is in place.  Also
> note that any such server-side connection-close-due-to-expired-credential
> functionality would likely have to avoid closing a connection until it is
> used for anything other than re-authentication -- it must allow
> re-authentication requests through when the credential is expired.
>
> Given all of the above, it feels to me like the low-level solution is
> viable only under the following conditions:
>
> 1) We must accept a delay in when re-authentication occurs to when the
> connection is used
> 2) We must accept the bigger latency spikes associated with not
> interleaving requests
>
> Does this sound right to you, and if so, do you find these conditions
> acceptable?  Or have I missed something and/or made incorrect assumptions
> somewhere?
>
> Ron
>
>
> On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino <rndg...@gmail.com> wrote:
>
>> Hi Rajini.  The code really helped me to understand what you were
>> thinking -- thanks.  I'm still digesting, but here are some quick
>> observations:
>>
>> Your approach (I'll call it the "low-level" approach, as compared to the
>> existing PR, which works at a higher level of abstraction) -- the low-level
>> approach is certainly intriguing.  The smaller code size is welcome, of
>> course, as is the fact that re-authentication simply works for everyone
>> regardless of the style of use (async vs. sync I/O).
>>
>> I did notice that re-authentication of the connection starts only if/when
>> the client uses the connection.  For example, if I run a console producer,
>> re-authentication does not happen unless I try to produce something.  On
>> the one hand this is potentially good -- if the client isn't using the
>> connection then the connection stays "silent" and could be closed on the
>> server side if it stays idle long enough -- whereas if we start injecting
>> re-authentication requests (as is done in the high-level approach) then the
>> connection never goes completely silent and could (?) potentially avoid
>> being closed due to idleness.
>>
>> However, if we implement sever-side kill of connections using expired
>> credentials (which we agree is needed), then we might end up with the
>> broker killing connections that are sitting idle for only a short period of
>> time.  For example, if we refresh the token on the client side and tell the
>> connection that it is eligible to be re-authenticated, then it is
>> conceivable that the connection might be sitting idle at that point and
>> might not be used until after the token it is currently using expires.  The
>> server might kill the connection, and that would force the client to
>> re-connect with a new connection (requiring TLS negotiation). The
>> probability of this happening increases as the token lifetime decreases, of
>> course, and it can be offset by decreasing the window factor (i.e. make it
>> eligible for re-authenticate at 50% of the lifetime rather than 80%, for
>> example -- it would have to sit idle for longer before the server tried to
>> kill it).  We haven't implemented server-side kill yet, so maybe we could
>> make it intelligent and only kill the connection if it is used (for
>> anything except re-authentication) after the expiration time...
>>
>> I also wonder about the ability to add retry into the low-level
>> approach.  Do you think it would be possible?  It doesn't seem like it to
>> me -- at least not without some big changes that would eliminate some of
>> the advantage of being able to reuse the existing authentication code.  The
>> reason I ask is because I think retry is necessary.  It is part of how
>> refreshes work for both GSSAPI and OAUTHBEARER -- they refresh based on
>> some window factor (i.e. 80%) and implement periodic retry upon failure so
>> that they can maximize the chances of having a new credential available for
>> any new connection attempt.  Without refresh we could end up in the
>> situation where the connection still has some lifetime left (10%, 20%, or
>> whatever) but it tries to re-authenticate and cannot through no fault of
>> its own (i.e. token endpoint down, some Kerberos failure, etc.) -=- the
>> connection is closed at that point, and it is then unable to reconnect
>> because of the same temporary problem.  We could end up with an especially
>> ill-timed, temporary outage in some non-Kafka system (related to OAuth or
>> Kerberos, or some LDAP directory) causing all clients to be kicked off the
>> cluster.  Retry capability seems to me to be the way to mitigate this risk.
>>
>> Anyway, that's it for now.  I really like the approach you outlined -- at
>> least at this point based on my current understanding.  I will continue to
>> dig in, and I may send more comments/questions.  But for now, I think the
>> lack of retry -- and my definitely-could-be-wrong sense that it can't
>> easily be added -- is my biggest concern with this low-level approach.
>>
>> Ron
>>
>> On Thu, Sep 6, 2018 at 4:57 PM Rajini Sivaram <rajinisiva...@gmail.com>
>> wrote:
>>
>>> Hi Ron,
>>>
>>> The commit
>>>
>>> https://github.com/rajinisivaram/kafka/commit/b9d711907ad843c11d17e80d6743bfb1d4e3f3fd
>>> shows
>>> the kind of flow I was thinking of. It is just a prototype with a fixed
>>> re-authentication period to explore the possibility of implementing
>>> re-authentication similar to authentication. There will be edge cases to
>>> cover and errors to handle, but hopefully the code makes the approach
>>> clearer than my earlier explanation!
>>>
>>> So the differences in the two implementations as you have already
>>> mentioned
>>> earlier.
>>>
>>>    1. Re-authentication sequences are not interleaved with Kafka
>>> requests.
>>>    As you said, this has a higher impact on latency. IMO, this also
>>> makes it
>>>    easier to debug, especially with complex protocols like Kerberos
>>> which are
>>>    notoriously difficult to diagnose.
>>>    2. Re-authentication failures will not be retried, they will be
>>> treated
>>>    as fatal errors similar to authentication failures. IMO, since we
>>> rely on
>>>    brokers never rejecting valid authentication requests (clients treat
>>>    authentication failures as fatal errors), it is ok to fail on
>>>    re-authentication failure as well.
>>>    3. Re-authentication is performed on the network thread on brokers
>>>    similar to authentication (in your implementation, I think it would
>>> be on
>>>    the request thread). IMO, it is better do all authentications using
>>> the
>>>    same thread pool.
>>>    4. Code complexity: I don't think actual code complexity would be very
>>>    different between the two implementations. But I think there is value
>>> in
>>>    keeping re-authentication code within existing network/security
>>> layers. The
>>>    number of classes modified will be smaller and the number of packages
>>>    modified even smaller.
>>>
>>> Anyway, let me know what you think:
>>>
>>>    - Will this approach work for your scenarios?
>>>    - Do we really need to handle re-authentication differently from
>>>    authentication?
>>>
>>>
>>> On Thu, Sep 6, 2018 at 1:40 PM, Ron Dagostino <rndg...@gmail.com> wrote:
>>>
>>> > Yeah, regarding ControllerChannelManager, it is one of the synchronous
>>> I/O
>>> > use cases (along with 2 others: KafkaProducer, via Sender; and
>>> > ReplicaFetcherBlockingSend, via ReplicaFetcherThread) where the
>>> assumption
>>> > is complete ownership of the connection. The PR's approach to dealing
>>> with
>>> > that assumption is to inject the re-authentication requests into the
>>> > owner's existing flow so that they are simply sent with a callback that
>>> > NetworkClient executes.  The alternative approach, which is what I
>>> think
>>> > you are investigating, is to allow the connection to be temporarily
>>> taken
>>> > offline and having any attempt by ControllerChannelManager (or other
>>> > synchronous use case owner) to use the connection while it is in this
>>> > "offline" state result in some kind of pause until the connection comes
>>> > back online.  One issue with this approach might be the length of time
>>> that
>>> > the connection is unavailable; will it be offline for all
>>> authentication
>>> > requests and responses (ApiVersionsRequest/Response,
>>> > SaslHandshakeRequest/Response, and SaslAuthenticateRequest/Response)?
>>> > Note
>>> > the last one could actually be invoked multiple times, so there could
>>> be 4
>>> > or more round-trips before the authentication "dance" is finished.
>>> Will
>>> > the connection be "offline" the entire time, or will it be placed back
>>> > "online" in between each request/response pair to allow the owner of
>>> the
>>> > connection to use it -- in which case the authentication process would
>>> have
>>> > to wait to get ownership again?  The approach I took interleaves the
>>> > authentication requests/responses with whatever the owner is doing, so
>>> it
>>> > is conceivable that use of the connection jumps back and forth between
>>> the
>>> > two purposes.  Such jumping back and forth minimizes any added latency
>>> due
>>> > to the re-authentication, of course.
>>> >
>>> > Anyway, I'll look forward to finding out what you are able to conclude.
>>> > Good luck :-)
>>> >
>>> > Ron
>>> >
>>> > On Thu, Sep 6, 2018 at 5:17 AM Rajini Sivaram <rajinisiva...@gmail.com
>>> >
>>> > wrote:
>>> >
>>> > > Hi Ron,
>>> > >
>>> > > Disconnections on the broker-side: I think we should do
>>> disconnections
>>> > as a
>>> > > separate KIP and PR as you originally intended. But that one could be
>>> > done
>>> > > separately without requiring KIP-368 as a pre-req. As a simpler
>>> > > implementation and one that can be used without KIP-368 in some
>>> cases, we
>>> > > could commit that first since this one may take longer.
>>> > >
>>> > > I wasn't suggesting that we do move re-authentication to
>>> NetworkClient, I
>>> > > was thinking of the lower layers, handling authentication and
>>> > > reauthentication at the same layer in a similar way. Let me look
>>> into the
>>> > > code and come up with a more detailed explanation to avoid confusion.
>>> > >
>>> > > I am not too concerned about the imports in KafkaChannel. As you
>>> said, we
>>> > > can improve that if we need to. KafkaChannels are aware of
>>> > > network/authentication states and if that becomes a bit more
>>> complex, I
>>> > > don't think it would matter so much. My concern is about changes like
>>> > >
>>> > > https://github.com/apache/kafka/pull/5582/files#diff-
>>> > 987fef43991384a3ebec5fb55e53b577
>>> > > in ControllerChannelManager. Those classes shouldn't have deal with
>>> SASL
>>> > or
>>> > > reauthentication. Anyway, I will get back with more detail on what I
>>> had
>>> > in
>>> > > mind since that may not be viable at all.
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Sep 6, 2018 at 1:44 AM, Ron Dagostino <rndg...@gmail.com>
>>> wrote:
>>> > >
>>> > > > I just thought of another alternative if the imports are the
>>> concern.
>>> > > > KafkaChannel could expose the fact that it can create an additional
>>> > > > Authenticator instance on the side (what I referred to as
>>> > > > notYetAuthenticatedAuthenticator in the PR) and it could let
>>> > > > kafka.server.KafkaApis drive the whole thing -- create the
>>> instance on
>>> > > the
>>> > > > side, clean it up if it fails, move it into place and close the
>>> old one
>>> > > if
>>> > > > it succeeds, etc.  Then KafkaChannel wouldn't need to import
>>> anything
>>> > new
>>> > > > -- it just exposes its Authenticator and the ability to perform the
>>> > swap
>>> > > > upon success, etc.
>>> > > >
>>> > > > Ron
>>> > > >
>>> > > > On Wed, Sep 5, 2018 at 5:01 PM Ron Dagostino <rndg...@gmail.com>
>>> > wrote:
>>> > > >
>>> > > > > Hi again, Rajini,  I realized a couple of potential concerns with
>>> > using
>>> > > > > the TransportLayer directly during re-authentication.   First,
>>> in the
>>> > > > > blocking I/O use case, the owner of the NetworkClient instance
>>> calls
>>> > > > > NetworkClientUtils.sendAndReceive() to send requests.   This
>>> method
>>> > > > > assumes the caller has exclusive access to the NetworkClient, so
>>> it
>>> > > does
>>> > > > > not check to see if the node is ready; it just sends the request
>>> and
>>> > > > > repeatedly calls poll() until the response arrives.  if we were
>>> to
>>> > take
>>> > > > > the connection temporarily offline that method currently has no
>>> > > mechanism
>>> > > > > for checking for such a state before it sends the request; we
>>> could
>>> > add
>>> > > > it,
>>> > > > > but we would have to put in some kind of sleep loop to keep
>>> checking
>>> > > for
>>> > > > it
>>> > > > > to come back "online" before it could send.  Adding such a sleep
>>> loop
>>> > > > could
>>> > > > > be done, of course, but it doesn't sound ideal.
>>> > > > >
>>> > > > > A similar sleep loop situation exists in the async use case.  The
>>> > owner
>>> > > > > repeatedly calls poll() in a loop, but if the connection to the
>>> node
>>> > is
>>> > > > > temporarily offline then the poll() method would have to enter a
>>> > sleep
>>> > > > > loop until either the connection comes back online or the timeout
>>> > > elapses
>>> > > > > (whichever comes first).
>>> > > > >
>>> > > > > I don't know if there is an aversion to adding sleep loops like
>>> that,
>>> > > so
>>> > > > > maybe it isn't a big issue, but I wanted to raise it as a
>>> potential
>>> > > > concern
>>> > > > > with this approach.
>>> > > > >
>>> > > > > Also, is the import of SASL-specific classes in KafkaChannel a
>>> major
>>> > > > > objection to the current implementation?  I could eliminate that
>>> by
>>> > > > > replacing the 2 offending methods in KafkaChannel with this one
>>> and
>>> > > > > having the implementation delegate to the authenticator:
>>> > > > >
>>> > > > >     /**
>>> > > > >      * Respond to a re-authentication request. This occurs on the
>>> > > > >      * Server side of the re-authentication dance (i.e. on the
>>> > broker).
>>> > > > >      *
>>> > > > >      * @param requestHeader
>>> > > > >      *            the request header
>>> > > > >      * @param request
>>> > > > >      *            the request to process
>>> > > > >      * @return the response to return to the client
>>> > > > >      */
>>> > > > >     public AbstractResponse respondToReauthenticationReque
>>> > > > st(RequestHeader
>>> > > > > requestHeader,
>>> > > > >             AbstractRequest request)
>>> > > > >
>>> > > > > There is practically no work being done in the KafkaChannel
>>> instance
>>> > > > > anyway -- it does some sanity checking but otherwise delegates
>>> to the
>>> > > > > authenticator.  We could just add a method to the Authenticator
>>> > > interface
>>> > > > > and delegate the whole thing.
>>> > > > >
>>> > > > > Ron
>>> > > > >
>>> > > > > On Wed, Sep 5, 2018 at 2:07 PM Ron Dagostino <rndg...@gmail.com>
>>> > > wrote:
>>> > > > >
>>> > > > >> Hi Rajini.  I'm now skeptical of my "ConnectionState.
>>> > REAUTHENTICATING"
>>> > > > idea.
>>> > > > >> The concept of a connection being "READY" or not can impact
>>> > > > >> ConsumerCoordinator (see, for example,
>>> > > > >> https://github.com/apache/kafka/blob/trunk/clients/src/
>>> > > > main/java/org/apache/kafka/clients/consumer/internals/
>>> > > > ConsumerCoordinator.java#L352).
>>> > > > >> The semantics of a connection being "READY" and the
>>> > > > >> implications/assumptions are not clear, and I suspect there
>>> will be
>>> > > some
>>> > > > >> unintended consequences of this approach that may not be
>>> immediately
>>> > > > >> apparent.
>>> > > > >>
>>> > > > >> I will confess that when I was implementing re-authentication I
>>> > > thought
>>> > > > >> it might be worthwhile to unify the authentication-related code
>>> > bases
>>> > > --
>>> > > > >> except I suspected it would be good to go in the other
>>> direction:
>>> > have
>>> > > > the
>>> > > > >> current code that directly uses the TransportLayer instead use
>>> > > > >> AuthenticationSuccessOrFailureReceiver and
>>> > > > AuthenticationRequestEnqueuer.
>>> > > > >> I'm not advocating that we do it -- I decided to not go there
>>> when
>>> > > > creating
>>> > > > >> the PR, after all -- but I did get a strong feeling that
>>> directly
>>> > > using
>>> > > > the
>>> > > > >> TransportLayer as is currently done is really only viable before
>>> > > anybody
>>> > > > >> else starts using the connection.  If we want to use the
>>> > > TransportLayer
>>> > > > again
>>> > > > >> after that point then it is up to us to somehow take the
>>> connection
>>> > > > >> "temporarily offline" so that we have exclusive rights to it
>>> again,
>>> > > and
>>> > > > I
>>> > > > >> wonder if the concept of a connection being "temporarily
>>> offline" is
>>> > > > >> something the existing code is able to handle -- probably not,
>>> and I
>>> > > > >> suspect there are unstated assumptions that would be
>>> invalidated.
>>> > > > >>
>>> > > > >> Do you think this particular "ConnectionState.REAUTHENTICATING"
>>> > idea
>>> > > is
>>> > > > >> worth pursuing?  How about the general idea of trying to use the
>>> > > > >> TransportLayer directly -- are you still feeling like it is
>>> viable?
>>> > > > >>
>>> > > > >> Ron
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >> Ron
>>> > > > >>
>>> > > > >> On Wed, Sep 5, 2018 at 11:40 AM Ron Dagostino <
>>> rndg...@gmail.com>
>>> > > > wrote:
>>> > > > >>
>>> > > > >>> <<<in favor of implementing server-side kill in addition to
>>> > > > >>> re-authentication, not as a replacement.
>>> > > > >>> <<<I think Rajini suggested the same thing
>>> > > > >>> Oh, ok, I misunderstood.  Then I think we are on the same
>>> page: if
>>> > we
>>> > > > >>> are going to do re-authentication, then we should also do
>>> > > > >>> server-side-kill-upon-expiration as part of the same
>>> > implementation.
>>> > > > I'm
>>> > > > >>> good with that.
>>> > > > >>>
>>> > > > >>> I am also looking into Rajini's idea of doing
>>> re-authentication at
>>> > > the
>>> > > > >>> NetworkClient level and reusing the existing authentication
>>> code
>>> > > > path.  I
>>> > > > >>> was skeptical when she suggested it, but now that I look
>>> closer I
>>> > see
>>> > > > >>> something that I can try.  NetworkClient has logic to
>>> recognize the
>>> > > > >>> state "ConnectionState.CONNECTING" as meaning "you can't do
>>> > anything
>>> > > > >>> with this connection at the moment; please wait."  I'm going
>>> to try
>>> > > > adding
>>> > > > >>> a new state "ConnectionState.REAUTHENTICATING" that would be
>>> > > > recognized
>>> > > > >>> in a similar way.  Then the challenge becomes inserting myself
>>> into
>>> > > any
>>> > > > >>> existing flow that might be going on.  I'll probably add the
>>> > request
>>> > > > to set
>>> > > > >>> the state to "REAUTHENTICATING" to a queue if I can't grab the
>>> > state
>>> > > > >>> immediately and have the network client's poll() method check
>>> at
>>> > the
>>> > > > >>> end to see if any such requests can be granted; there would be
>>> a
>>> > > > callback
>>> > > > >>> associated with the request, and that way I can be assured I
>>> would
>>> > be
>>> > > > >>> granted the request in a reasonable amount of time (assuming
>>> the
>>> > > > connection
>>> > > > >>> doesn't close in the meantime).  Then it would be up the
>>> callback
>>> > > > >>> implementation to perform the re-authentication dance and set
>>> the
>>> > > state
>>> > > > >>> back to "ConnectionState.READY".  I don't know if this will
>>> work,
>>> > and
>>> > > > >>> I'm probably missing some subtleties at the moment, but I'll
>>> give
>>> > it
>>> > > a
>>> > > > shot
>>> > > > >>> and see what happens.
>>> > > > >>>
>>> > > > >>> Ron
>>> > > > >>>
>>> > > > >>> On Wed, Sep 5, 2018 at 11:23 AM Colin McCabe <
>>> cmcc...@apache.org>
>>> > > > wrote:
>>> > > > >>>
>>> > > > >>>> On Wed, Sep 5, 2018, at 07:34, Ron Dagostino wrote:
>>> > > > >>>> > I added a "How To Support Re-Authentication for Other SASL
>>> > > > Mechanisms"
>>> > > > >>>> > section to the KIP as Rajini suggested.  I also added a
>>> > "Rejected
>>> > > > >>>> > Alternative" for the idea of forcibly closing connections
>>> on the
>>> > > > >>>> client
>>> > > > >>>> > side upon token refresh or on the server side upon token
>>> > > expiration.
>>> > > > >>>> It
>>> > > > >>>> > may be a bit premature to reject the server-side kill
>>> scenario
>>> > > given
>>> > > > >>>> that
>>> > > > >>>> > Colin and Rajini are partial to it, but see below for what I
>>> > said
>>> > > > >>>> about it,
>>> > > > >>>> > and I think it makes sense -- server-side kill without an
>>> > ability
>>> > > > for
>>> > > > >>>> the
>>> > > > >>>> > client to re-authenticate to avoid it may be useful in
>>> certain
>>> > > > >>>> specific
>>> > > > >>>> > cases, but as a general feature it doesn't really work.  I
>>> would
>>> > > be
>>> > > > >>>> willing
>>> > > > >>>> > to add server-side-kill to the scope of this KIP if that is
>>> > > desired.
>>> > > > >>>>
>>> > > > >>>> Hi Ron,
>>> > > > >>>>
>>> > > > >>>> To clarify, I am in favor of implementing server-side kill in
>>> > > addition
>>> > > > >>>> to re-authentication, not as a replacement.  I think Rajini
>>> > > suggested
>>> > > > the
>>> > > > >>>> same thing.
>>> > > > >>>>
>>> > > > >>>> It seems clear that server-side kill is needed to provide
>>> > security.
>>> > > > >>>> Otherwise a bad client can simply decide not to
>>> re-authenticate,
>>> > and
>>> > > > >>>> continue using server resources indefinitely.  Neither
>>> > > authentication
>>> > > > nor
>>> > > > >>>> re-authentication should be optional, or else the bad guys
>>> will
>>> > > > simply take
>>> > > > >>>> the option not to authenticate.
>>> > > > >>>>
>>> > > > >>>> best,
>>> > > > >>>> Colin
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>>> >
>>> > > > >>>> > A brute-force alternative is to simply kill the connection
>>> on
>>> > the
>>> > > > >>>> client
>>> > > > >>>> > > side when the background login thread refreshes the
>>> > credential.
>>> > > > The
>>> > > > >>>> > > advantage is that we don't need a code path for
>>> > > re-authentication
>>> > > > –
>>> > > > >>>> the
>>> > > > >>>> > > client simply connects again to replace the connection
>>> that
>>> > was
>>> > > > >>>> killed.
>>> > > > >>>> > > There are many disadvantages, though.  The approach is
>>> harsh –
>>> > > > >>>> having
>>> > > > >>>> > > connections pulled out from underneath the client will
>>> > introduce
>>> > > > >>>> latency
>>> > > > >>>> > > while the client reconnects; it introduces non-trivial
>>> > resource
>>> > > > >>>> utilization
>>> > > > >>>> > > on both the client and server as TLS is renegotiated; and
>>> it
>>> > > > forces
>>> > > > >>>> the
>>> > > > >>>> > > client to periodically "recover" from what essentially
>>> looks
>>> > > like
>>> > > > a
>>> > > > >>>> failure
>>> > > > >>>> > > scenario.  While these are significant disadvantages, the
>>> most
>>> > > > >>>> significant
>>> > > > >>>> > > disadvantage of all is that killing connections on the
>>> client
>>> > > side
>>> > > > >>>> adds no
>>> > > > >>>> > > security – trusting the client to kill its connection in a
>>> > > timely
>>> > > > >>>> fashion
>>> > > > >>>> > > is a blind and unjustifiable trust.
>>> > > > >>>> > >
>>> > > > >>>> >
>>> > > > >>>> >
>>> > > > >>>> > > We could kill the connection from the server side instead,
>>> > when
>>> > > > the
>>> > > > >>>> token
>>> > > > >>>> > > expires.  But in this case, if there is no ability for the
>>> > > client
>>> > > > to
>>> > > > >>>> > > re-authenticate to avoid the killing of the connection in
>>> the
>>> > > > first
>>> > > > >>>> place,
>>> > > > >>>> > > then we still have all of the harsh approach disadvantages
>>> > > > >>>> mentioned above.
>>> > > > >>>> >
>>> > > > >>>> >
>>> > > > >>>> > Ron
>>> > > > >>>> >
>>> > > > >>>> > On Wed, Sep 5, 2018 at 10:25 AM Colin McCabe <
>>> > cmcc...@apache.org>
>>> > > > >>>> wrote:
>>> > > > >>>> >
>>> > > > >>>> > > On Wed, Sep 5, 2018, at 01:41, Rajini Sivaram wrote:
>>> > > > >>>> > > > *Re-authentication vs disconnection:*
>>> > > > >>>> > > > In a vast number of secure Kafka deployments, SASL_SSL
>>> is
>>> > the
>>> > > > >>>> security
>>> > > > >>>> > > > protocol (this is the recommended config for
>>> OAUTHBEARER).
>>> > If
>>> > > we
>>> > > > >>>> require
>>> > > > >>>> > > > disconnections on token expiry, we would need new
>>> > connections
>>> > > to
>>> > > > >>>> be
>>> > > > >>>> > > > established with an expensive SSL handshake. This adds
>>> load
>>> > on
>>> > > > >>>> the broker
>>> > > > >>>> > > > and will result in a latency spike. For OAUTHBEARER in
>>> > > > >>>> particular, when
>>> > > > >>>> > > > tokens are used to make authorisation decisions, we
>>> want to
>>> > > be a
>>> > > > >>>> able to
>>> > > > >>>> > > > support short-lived tokens where token lifetime
>>> (granting
>>> > > > >>>> authorisation)
>>> > > > >>>> > > is
>>> > > > >>>> > > > small. To make this usable in practice, I believe we
>>> need to
>>> > > > >>>> support
>>> > > > >>>> > > > re-authentication of existing connections.
>>> > > > >>>> > >
>>> > > > >>>> > > Hi Rajini,
>>> > > > >>>> > >
>>> > > > >>>> > > Thanks for the explanation.  That makes sense.
>>> > > > >>>> > >
>>> > > > >>>> > > >
>>> > > > >>>> > > > *Also explicitly out-of-scope for this proposal is the
>>> > ability
>>> > > > for
>>> > > > >>>> > > brokers
>>> > > > >>>> > > > to close connections that continue to use expired
>>> > credentials.
>>> > > > >>>> This
>>> > > > >>>> > > > ability is a natural next step, but it will be addressed
>>> > via a
>>> > > > >>>> separate
>>> > > > >>>> > > KIP
>>> > > > >>>> > > > if/when this one is adopted.*
>>> > > > >>>> > > > Perhaps we could do this the other way round? I don't
>>> think
>>> > we
>>> > > > >>>> would ever
>>> > > > >>>> > > > want to close connections on the client-side to support
>>> > > expired
>>> > > > >>>> > > credentials
>>> > > > >>>> > > > because that doesn't add any security guarantees. But
>>> we do
>>> > > > >>>> require the
>>> > > > >>>> > > > ability for brokers to close connections in order to
>>> enforce
>>> > > > >>>> credential
>>> > > > >>>> > > > expiry. Disconnection on the broker-side may be
>>> sufficient
>>> > for
>>> > > > >>>> some
>>> > > > >>>> > > > deployments and could be useful on its own. It would
>>> also be
>>> > > the
>>> > > > >>>> easier
>>> > > > >>>> > > > implementation. So maybe that could be the first step?
>>> > > > >>>> > >
>>> > > > >>>> > > +1 for doing disconnection first.  Otherwise, as you
>>> noted,
>>> > > there
>>> > > > >>>> are no
>>> > > > >>>> > > security guarantees -- the client can just decide not to
>>> > > > >>>> re-authenticate
>>> > > > >>>> > > and keep using the old credentials.  You don't even need
>>> to
>>> > > modify
>>> > > > >>>> the
>>> > > > >>>> > > source code -- older clients would behave this way.
>>> > > > >>>> > >
>>> > > > >>>> > > best,
>>> > > > >>>> > > Colin
>>> > > > >>>> > >
>>> > > > >>>> > > >
>>> > > > >>>> > > > *The implementation is designed in such a way that it
>>> does
>>> > not
>>> > > > >>>> preclude
>>> > > > >>>> > > > adding support for re-authentication of other SASL
>>> mechanism
>>> > > > >>>> (e.g. PLAIN,
>>> > > > >>>> > > > SCRAM-related, and GSSAPI), but doing so is explicitly
>>> > > > >>>> out-of-scope for
>>> > > > >>>> > > > this proposal. *
>>> > > > >>>> > > > Isn't re-authentication driven by ExpiringCredential? We
>>> > don't
>>> > > > >>>> need to
>>> > > > >>>> > > > support re-authentication by default for the other
>>> > mechanisms
>>> > > in
>>> > > > >>>> this
>>> > > > >>>> > > KIP,
>>> > > > >>>> > > > but any mechanism could enable this by adding a custom
>>> login
>>> > > > >>>> callback
>>> > > > >>>> > > > handler to provide an ExpiringCredential? For
>>> disconnection
>>> > as
>>> > > > >>>> well as
>>> > > > >>>> > > > re-authentication, it will be good if we can specify
>>> exactly
>>> > > how
>>> > > > >>>> it can
>>> > > > >>>> > > be
>>> > > > >>>> > > > implemented for each of the SASL mechanisms, even if we
>>> > > actually
>>> > > > >>>> > > implement
>>> > > > >>>> > > > it only for OAUTHBEARER.
>>> > > > >>>> > > >
>>> > > > >>>> > > >
>>> > > > >>>> > > > On Wed, Sep 5, 2018 at 2:43 AM, Colin McCabe <
>>> > > > cmcc...@apache.org>
>>> > > > >>>> wrote:
>>> > > > >>>> > > >
>>> > > > >>>> > > > > On Tue, Sep 4, 2018, at 17:43, Ron Dagostino wrote:
>>> > > > >>>> > > > > > Hi Colin.  Different organizations will rely on
>>> > different
>>> > > > >>>> token
>>> > > > >>>> > > > > lifetimes,
>>> > > > >>>> > > > > > but anything shorter than an hour feels like it
>>> would be
>>> > > > >>>> pretty
>>> > > > >>>> > > > > > aggressive.  An hour or more will probably be most
>>> > common.
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > Thanks.  That's helpful to give me a sense of what the
>>> > > > >>>> performance
>>> > > > >>>> > > impact
>>> > > > >>>> > > > > might be.
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > >
>>> > > > >>>> > > > > > <<<alternate solution of terminating connections
>>> when
>>> > the
>>> > > > >>>> bearer
>>> > > > >>>> > > token
>>> > > > >>>> > > > > > changed
>>> > > > >>>> > > > > > I may be mistaken, but I think you are suggesting
>>> here
>>> > > that
>>> > > > we
>>> > > > >>>> > > forcibly
>>> > > > >>>> > > > > > kill connections from the client side whenever the
>>> > > > background
>>> > > > >>>> Login
>>> > > > >>>> > > > > refresh
>>> > > > >>>> > > > > > thread refreshes the token (which it currently does
>>> so
>>> > > that
>>> > > > >>>> the
>>> > > > >>>> > > client
>>> > > > >>>> > > > > can
>>> > > > >>>> > > > > > continue to make new connections).  Am I correct?
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > Yes, this is what I'm thinking about.  We could also
>>> > > terminate
>>> > > > >>>> the
>>> > > > >>>> > > > > connection on the server, if that is more convenient.
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > >  If that is what you are
>>> > > > >>>> > > > > > referring to, my sense is that it would be a very
>>> crude
>>> > > way
>>> > > > of
>>> > > > >>>> > > dealing
>>> > > > >>>> > > > > with
>>> > > > >>>> > > > > > the issue that would probably lead to
>>> dissatisfaction in
>>> > > > some
>>> > > > >>>> sense
>>> > > > >>>> > > > > (though
>>> > > > >>>> > > > > > I can't be sure).
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > What information should we gather so that we can be
>>> sure?
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > >  I do know that when I implemented SASL/OAUTHBEARER
>>> it
>>> > > > >>>> > > > > > was communicated that leaving existing connections
>>> > intact
>>> > > --
>>> > > > >>>> as is
>>> > > > >>>> > > done
>>> > > > >>>> > > > > for
>>> > > > >>>> > > > > > GSSAPI -- was the appropriate path forward.
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > Thanks, that's good background information.  Can
>>> someone
>>> > > chime
>>> > > > >>>> in with
>>> > > > >>>> > > the
>>> > > > >>>> > > > > reasoning behind this?
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > My best guess is that terminating connections might
>>> cause
>>> > a
>>> > > > >>>> temporary
>>> > > > >>>> > > > > increase in latency as they are re-established.
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > In any case, we should figure out what the reasoning
>>> is so
>>> > > > that
>>> > > > >>>> we can
>>> > > > >>>> > > > > make a decision.  It seems worthwhile including this
>>> as a
>>> > > > >>>> "rejected
>>> > > > >>>> > > > > alternative," at least.
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > thanks,
>>> > > > >>>> > > > > Colin
>>> > > > >>>> > > > >
>>> > > > >>>> > > > >
>>> > > > >>>> > > > > >
>>> > > > >>>> > > > > > Ron
>>> > > > >>>> > > > > >
>>> > > > >>>> > > > > > On Tue, Sep 4, 2018 at 8:31 PM Colin McCabe <
>>> > > > >>>> cmcc...@apache.org>
>>> > > > >>>> > > wrote:
>>> > > > >>>> > > > > >
>>> > > > >>>> > > > > > > Hi Ron,
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > > Thanks for the KIP.
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > > What is the frequency at which you envision bearer
>>> > > tokens
>>> > > > >>>> changing
>>> > > > >>>> > > at?
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > > Did you consider the alternate solution of
>>> terminating
>>> > > > >>>> connections
>>> > > > >>>> > > when
>>> > > > >>>> > > > > > > the bearer token changed?
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > > best,
>>> > > > >>>> > > > > > > Colin
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > > On Tue, Aug 28, 2018, at 07:28, Ron Dagostino
>>> wrote:
>>> > > > >>>> > > > > > > > Hi everyone. I created KIP 368: Allow SASL
>>> > Connections
>>> > > > to
>>> > > > >>>> > > > > Periodically
>>> > > > >>>> > > > > > > > Re-Authenticate
>>> > > > >>>> > > > > > > > <
>>> > > > >>>> > > > > > > https://cwiki.apache.org/
>>> > confluence/display/KAFKA/KIP-
>>> > > > >>>> > > > >
>>> > > 368%3A+Allow+SASL+Connections+to+Periodically+Re-Authenticate
>>> > > > >>>> > > > > > > >
>>> > > > >>>> > > > > > > > (
>>> > > > >>>> > > > > > > >
>>> > > > >>>> > > > > > > https://cwiki.apache.org/
>>> > confluence/display/KAFKA/KIP-
>>> > > > >>>> > > > >
>>> > > 368%3A+Allow+SASL+Connections+to+Periodically+Re-Authenticate
>>> > > > >>>> > > > > > > ).
>>> > > > >>>> > > > > > > > The motivation for this KIP is as follows:
>>> > > > >>>> > > > > > > >
>>> > > > >>>> > > > > > > > The adoption of KIP-255: OAuth Authentication
>>> via
>>> > > > >>>> > > SASL/OAUTHBEARER
>>> > > > >>>> > > > > > > > <
>>> > > > >>>> > > > > > >
>>> https://cwiki.apache.org/confluence/pages/viewpage.
>>> > > > >>>> > > > > action?pageId=75968876>
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > > > > > in
>>> > > > >>>> > > > > > > > release 2.0.0 creates the possibility of using
>>> > > > >>>> information in the
>>> > > > >>>> > > > > bearer
>>> > > > >>>> > > > > > > > token to make authorization decisions.
>>> > Unfortunately,
>>> > > > >>>> however,
>>> > > > >>>> > > Kafka
>>> > > > >>>> > > > > > > > connections are long-lived, so there is no
>>> ability
>>> > to
>>> > > > >>>> change the
>>> > > > >>>> > > > > bearer
>>> > > > >>>> > > > > > > > token associated with a particular connection.
>>> > > Allowing
>>> > > > >>>> SASL
>>> > > > >>>> > > > > > > > connections
>>> > > > >>>> > > > > > > > to periodically re-authenticate would resolve
>>> this.
>>> > > In
>>> > > > >>>> addition
>>> > > > >>>> > > to
>>> > > > >>>> > > > > this
>>> > > > >>>> > > > > > > > motivation there are two others that are
>>> > > > security-related.
>>> > > > >>>> > > First, to
>>> > > > >>>> > > > > > > > eliminate access to Kafka for connected
>>> clients, the
>>> > > > >>>> current
>>> > > > >>>> > > > > requirement
>>> > > > >>>> > > > > > > > is
>>> > > > >>>> > > > > > > > to remove all authorizations (i.e. remove all
>>> ACLs).
>>> > > > >>>> This is
>>> > > > >>>> > > > > necessary
>>> > > > >>>> > > > > > > > because of the long-lived nature of the
>>> connections.
>>> > > It
>>> > > > >>>> is
>>> > > > >>>> > > > > > > > operationally
>>> > > > >>>> > > > > > > > simpler to shut off access at the point of
>>> > > > >>>> authentication, and
>>> > > > >>>> > > with
>>> > > > >>>> > > > > the
>>> > > > >>>> > > > > > > > release of KIP-86: Configurable SASL Callback
>>> > Handlers
>>> > > > >>>> > > > > > > > <
>>> > > > >>>> > > > > > > https://cwiki.apache.org/
>>> > confluence/display/KAFKA/KIP-
>>> > > > >>>> > > > > 86%3A+Configurable+SASL+callback+handlers
>>> > > > >>>> > > > > > > >
>>> > > > >>>> > > > > > > > it
>>> > > > >>>> > > > > > > > is going to become more and more likely that
>>> > > > >>>> installations will
>>> > > > >>>> > > > > > > > authenticate users against external directories
>>> > (e.g.
>>> > > > via
>>> > > > >>>> > > LDAP).  The
>>> > > > >>>> > > > > > > > ability to stop Kafka access by simply
>>> disabling an
>>> > > > >>>> account in an
>>> > > > >>>> > > > > LDAP
>>> > > > >>>> > > > > > > > directory (for example) is desirable.  The
>>> second
>>> > > > >>>> motivating
>>> > > > >>>> > > factor
>>> > > > >>>> > > > > for
>>> > > > >>>> > > > > > > > re-authentication related to security is that
>>> the
>>> > use
>>> > > of
>>> > > > >>>> > > short-lived
>>> > > > >>>> > > > > > > > tokens
>>> > > > >>>> > > > > > > > is a common OAuth security recommendation, but
>>> > > issuing a
>>> > > > >>>> > > short-lived
>>> > > > >>>> > > > > > > > token
>>> > > > >>>> > > > > > > > to a Kafka client (or a broker when OAUTHBEARER
>>> is
>>> > the
>>> > > > >>>> > > inter-broker
>>> > > > >>>> > > > > > > > protocol) currently has no benefit because once
>>> a
>>> > > client
>>> > > > >>>> is
>>> > > > >>>> > > > > connected to
>>> > > > >>>> > > > > > > > a
>>> > > > >>>> > > > > > > > broker the client is never challenged again and
>>> the
>>> > > > >>>> connection
>>> > > > >>>> > > may
>>> > > > >>>> > > > > > > > remain
>>> > > > >>>> > > > > > > > intact beyond the token expiration time (and may
>>> > > remain
>>> > > > >>>> intact
>>> > > > >>>> > > > > > > > indefinitely
>>> > > > >>>> > > > > > > > under perfect circumstances).  This KIP proposes
>>> > > adding
>>> > > > >>>> the
>>> > > > >>>> > > ability
>>> > > > >>>> > > > > for
>>> > > > >>>> > > > > > > > clients (and brokers when OAUTHBEARER is the
>>> > > > inter-broker
>>> > > > >>>> > > protocol)
>>> > > > >>>> > > > > to
>>> > > > >>>> > > > > > > > re-authenticate their connections to brokers and
>>> > have
>>> > > > the
>>> > > > >>>> new
>>> > > > >>>> > > bearer
>>> > > > >>>> > > > > > > > token
>>> > > > >>>> > > > > > > > appear on their session rather than the old one.
>>> > > > >>>> > > > > > > >
>>> > > > >>>> > > > > > > > The description of this KIP is actually quite
>>> > > > >>>> straightforward
>>> > > > >>>> > > from a
>>> > > > >>>> > > > > > > > functionality perspective; from an
>>> implementation
>>> > > > >>>> perspective,
>>> > > > >>>> > > > > though,
>>> > > > >>>> > > > > > > the
>>> > > > >>>> > > > > > > > KIP is not so straightforward, so it includes a
>>> pull
>>> > > > >>>> request
>>> > > > >>>> > > with a
>>> > > > >>>> > > > > > > > proposed implementation.  See
>>> > > > https://github.com/apache/
>>> > > > >>>> > > > > kafka/pull/5582.
>>> > > > >>>> > > > > > > >
>>> > > > >>>> > > > > > > > Ron
>>> > > > >>>> > > > > > >
>>> > > > >>>> > > > >
>>> > > > >>>> > >
>>> > > > >>>>
>>> > > > >>>
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to