Thanks for the KIP Chris!

This proposal will fill a gap in Kafka Connect deployments and will
strengthen their production readiness in even more diverse environments, in
which current workarounds are less practical.

I've read the discussions regarding why the rebalancing protocol is used
here and your intention to follow the approach which was recently used in
order to elegantly support upgrades without requiring rolling restarts
makes sense.

However, I'd like to revisit the proposed extension of the connect protocol
going a little bit more in detail.
Indeed, broadcasting the session key to the followers is necessary. But
adding it to the configs topic with a new key is compatible with previous
versions (the key will be ignored by workers that don't support this
protocol) and works since all workers will need to read this topic to the
end to remain in the group.

Additionally, to your point about consensus, meaning how the workers all
know during the same generation that the "sessioned" version of the
protocol was chosen by the leader, it seems to me that an explicit upgrade
of the protocol on the wire, on its name and on the metadata that are sent
with the join request by each worker might not be required. What I believe
would suffice is merely an increment of the version of the protocol,
because all the other information remains the same (for example, the list
of assigned and revoked tasks, etc). This would allow us not to extend the
metadata even more with yet another JoinGroupRequest and could achieve what
you want in terms of an orchestrated upgrade.

Without having exhaustively checked the various code paths, I feel that
such an approach would be less heavy-handed in terms of extending the
format of the connect protocol and would achieve a certain separation of
concerns between the information that is required for rebalancing and is
included in the protocol metadata and the information that is required for
securing the internal Connect REST endpoint. In theory, this method could
be used to even secure the eager version of the protocol, but I'd agree
that this out of the scope of the current proposal.

All the leader worker would have to check is whether all the assignments
that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
you'd choose to go this route).

Overall, this does not differ a lot from your current proposal, which I
think is already in the right direction.
Let me know what you think.

Cheers,
Konstantine


On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <mage...@confluent.io>
wrote:

> Hi Chris,
>
> Thanks a lot for the KIP. This will certainly be a useful feature. I would
> have preferred to use the topic approach as well but I also understand your
> point of view about the operational complexity for upgrades. If not with
> this KIP, I would certainly want to go that route at some point in the
> future.
>
> As far as using the rebalance protocol goes, it would be great if you could
> elaborate on what exactly would be the rebalance impact when a key expires.
> I see that you have called out saying that there should be no significant
> impact but it will be great to explicitly state as what is to be expected.
> I would prefer to not have any sorts of rebalancing when this happens since
> the connector and task assignments should not change with this event. It
> will be useful to explain this a bit more.
>
> Thanks,
> Magesh
>
> On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <chr...@confluent.io> wrote:
>
> > Hi all,
> >
> > I've made some tweaks to the KIP that I believe are improvements. More
> > detail can be found on the KIP page itself, but as a brief summary, the
> > three changes are:
> >
> > - The removal of the internal.request.verification property in favor of
> > modifying the default value for the connect.protocol property from
> > "compatible" to "sessioned"
> > - The renaming of some configurations to use better terminology (mostly
> > just "request" instead of "key" where appropriate)
> > - The addition of two new configurations that dictate how session keys
> are
> > to be generated
> >
> > Thanks Ryanne for the feedback so far, hope to hear from some more of you
> > soon :)
> >
> > Cheers,
> >
> > Chris
> >
> > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <chr...@confluent.io>
> > wrote:
> >
> > > Hi Ryanne,
> > >
> > > The reasoning for this is provided in the KIP: "There would be no clear
> > > way to achieve consensus amongst the workers in a cluster on whether to
> > > switch to this new behavior." To elaborate on this--it would be bad if
> a
> > > follower worker began writing task configurations to the topic while
> the
> > > leader continued to only listen on the internal REST endpoint. It's
> > > necessary to ensure that every worker in a cluster supports this new
> > > behavior before switching it on.
> > >
> > > To be fair, it is technically possible to achieve consensus without
> using
> > > the group coordination protocol by adding new configurations and using
> a
> > > multi-phase rolling upgrade, but the operational complexity of this
> > > approach would significantly complicate things for the default case of
> > just
> > > wanting to bump your Connect cluster to the next version without having
> > to
> > > alter your existing worker config files and with only a single restart
> > per
> > > worker. The proposed approach provides a much simpler user experience
> for
> > > the most common use case and doesn't impose much additional complexity
> > for
> > > other use cases.
> > >
> > > I've updated the KIP to expand on points from your last emails; let me
> > > know what you think.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan <ryannedo...@gmail.com>
> > > wrote:
> > >
> > >> Thanks Chris, that makes sense.
> > >>
> > >> I know you have already considered this, but I'm not convinced we
> should
> > >> rule out using Kafka topics for this purpose. That would enable the
> same
> > >> level of security without introducing any new authentication or
> > >> authorization mechanisms (your keys). And as you say, you'd need to
> lock
> > >> down Connect's topics and groups anyway.
> > >>
> > >> Can you explain what you mean when you say using Kafka topics would
> > >> require
> > >> "reworking the group coordination protocol"? I don't see how these are
> > >> related. Why would it matter if the workers sent Kafka messages vs
> POST
> > >> requests among themselves?
> > >>
> > >> Ryanne
> > >>
> > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton <chr...@confluent.io>
> > >> wrote:
> > >>
> > >> > Hi Ryanne,
> > >> >
> > >> > Yes, if the Connect group is left unsecured then that is a potential
> > >> > vulnerability. However, in a truly secure Connect cluster, the group
> > >> would
> > >> > need to be secured anyways to prevent attackers from joining the
> group
> > >> with
> > >> > the intent to either snoop on connector/task configurations or bring
> > the
> > >> > cluster to a halt by spamming the group with membership requests and
> > >> then
> > >> > not running the assigned connectors/tasks. Additionally, for a
> Connect
> > >> > cluster to be secure, access to internal topics (for configs,
> offsets,
> > >> and
> > >> > statuses) would also need to be restricted so that attackers could
> > not,
> > >> > e.g., write arbitrary connector/task configurations to the configs
> > >> topic.
> > >> > This is all currently possible in Kafka with the use of ACLs.
> > >> >
> > >> > I think the bottom line here is that there's a number of steps that
> > >> need to
> > >> > be taken to effectively lock down a Connect cluster; the point of
> this
> > >> KIP
> > >> > is to close a loophole that exists even after all of those steps are
> > >> taken,
> > >> > not to completely secure this one vulnerability even when no other
> > >> security
> > >> > measures are taken.
> > >> >
> > >> > Cheers,
> > >> >
> > >> > Chris
> > >> >
> > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan <
> ryannedo...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Chris, I don't understand how the rebalance protocol can be used
> to
> > >> give
> > >> > > out session tokens in a secure way. It seems that any attacker
> could
> > >> just
> > >> > > join the group and sign requests with the provided token. Am I
> > missing
> > >> > > something?
> > >> > >
> > >> > > Ryanne
> > >> > >
> > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton <chr...@confluent.io>
> > >> wrote:
> > >> > >
> > >> > > > The KIP page can be found at
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > >> > > > ,
> > >> > > > by the way. Apologies for neglecting to include it in my initial
> > >> email!
> > >> > > >
> > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton <
> > chr...@confluent.io
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > I'd like to start discussion on a KIP to secure the internal
> > "POST
> > >> > > > > /connectors/<name>/tasks" endpoint for the Connect framework.
> > The
> > >> > > > proposed
> > >> > > > > changes address a vulnerability in the framework in its
> current
> > >> state
> > >> > > > that
> > >> > > > > allows malicious users to write arbitrary task configurations
> > for
> > >> > > > > connectors; it is vital that this issue be addressed in order
> > for
> > >> any
> > >> > > > > Connect cluster to be secure.
> > >> > > > >
> > >> > > > > Looking forward to your thoughts,
> > >> > > > >
> > >> > > > > Chris
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to