Re: [DISCUSS] KIP-618: Atomic commit of source connector records and offsets

Chris Egerton Thu, 13 May 2021 18:04:47 -0700

Hi Tom,

I'm fine with an implicit mapping of connector-provided null to
user-exposed UNKNOWN, if the design continues down that overall path.

Allowing users to assert that a connector should support exactly-once
sounds reasonable; it's similar to the pre-flight checks we already do for
connector configurations such as invoking "Connector::validate" and
ensuring that all of the referenced SMTs, Predicates, and Converter classes
are present on the worker. In fact, I wonder if that's how we could
implement it--as a preflight check. That way, Connector and Task instances
won't even have the chance to fail; if the user states a requirement for
exactly-once support but their connector configuration doesn't meet that
requirement, we can fail the connector creation/reconfiguration request
before even writing the new config to the config topic. We could also add
this support to the "PUT /{connectorType}/config/validate" endpoint so that
users could test exactly-once support for various configurations without
having to actually create or reconfigure a connector. We could still fail
tasks on startup if something slipped by (possibly due to connector
upgrade) but it'd make the UX a bit smoother in most cases to fail faster.

Since a possible use of the property is to allow future users to control
exactly-once support on a per-connector basis, I wonder whether a binary
property is sufficient here. Even if a connector doesn't support
exactly-once, there could still be benefits to using a transactional
producer with rounds of zombie fencing; for example, preventing duplicate
task instances from producing data, which could be leveraged to provide
at-most-once delivery guarantees. In that case, we'd want a way to signal
to Connect that the framework should do everything it does to provide
exactly-once source support, but not make the assertion on the connector
config, and we'd end up providing three possibilities to users: required,
best-effort, and disabled. It sounds like right now what we're proposing is
that we expose only the first two and don't allow users to actually disable
exactly-once support on a per-connector basis, but want to leave room for
the third option in the future. With that in mind, "required/not_required"
might not be the best fit. Perhaps "required"/"requested" for now, with
"disabled" as the value that could be implemented later?

RE: "Is the problem here simply that the zombie fencing provided by the
producer is only available when using transactions, and therefore having a
non-transactional producer in the cluster poses a risk of a zombie not
being fenced?"--that's half of it. The other half is we'd still need to
track the number of tasks for that connector that would need to be fenced
out if/when exactly-once for it were switched on. If we had the
intermediate producer you describe at our disposal, and it were in use by
every running source task for a given connector, we could probably enable
users to toggle exactly-once on a per-connector basis, but it would also
require new ACLs for all connectors. Even though we're allowed to make
breaking changes with the upcoming 3.0 release, I'm not sure the tradeoff
is worth it. I suppose we could break down exactly-once support into two
separate config properties--a worker-level property, that causes all source
tasks on the worker to use producers that can be fenced (either full-on
transactional producers or "intermediate" producers), and a per-connector
property, that toggles whether the connector itself uses a full-on
transactional producer or just an intermediate producer (and whether or not
zombie fencing is performed for new task configs). This seems like it might
be overkill for now, though.

As far as the zombie fencing endpoint goes--the behavior will be the same
either way w/r/t the exactly.once.source.enabled property. The property
will dictate whether the endpoint is used by tasks, but it'll be available
for use no matter what. This is how a rolling upgrade becomes possible;
even if the leader hasn't been upgraded yet (to set
exactly.once.source.enabled to true), it will still be capable of handling
fencing requests from workers that have already been upgraded.

Cheers,

Chris

On Wed, May 12, 2021 at 5:33 AM Tom Bentley <tbent...@redhat.com> wrote:

> Hi Chris and Randall,
>
> I can see that for connectors where exactly once is configuration-dependent
> it makes sense to use a default method. The problem with having an explicit
> UNKNOWN case is we really want connector developers to _not_ use it. That
> could mean it's deprecated from the start. Alternatively we could omit it
> from the enum and use null to mean unknown (we'd have to check for a null
> result anyway), with the contract for the method being that it should
> return non-null. Of course, this doesn't remove the ambiguous case, but
> avoids the need to eventually remove UNKNOWN in the future.
>
> I think there's another way for a worker to use the value too: Imagine
> you're deploying a connector that you need to be exactly once. It's awkward
> to have to query the REST API to determine that exactly once was working,
> especially if you need to do this after config changes too. What you
> actually want is to make an EOS assertion, via a connector config (e.g.
> require.exactly.once=true, or perhaps exactly.once=required/not_required),
> which would fail the connector/task if exactly once could not be provided.
>
> The not_required case wouldn't disable the transactional runtime
> environment, simply not guarantee that it was providing EOS. Although it
> would leave the door open to supporting mixed EOS/non-transactional
> deployments in the cluster in the future, if that became possible (i.e. we
> could retrospectively make not_required mean no transactions).
>
> On the subject of why it's not possible to enabled exactly once on a
> per-connector basis: Is the problem here simply that the zombie fencing
> provided by the producer is only available when using transactions, and
> therefore having a non-transactional producer in the cluster poses a risk
> of a zombie not being fenced? This makes me wonder whether there's a case
> for a producer with zombie fencing that is not transactional (intermediate
> between idempotent and transactional producer). IIUC this would need to
> make a InitProducerId request and use the PID in produce requests, but
> could dispense with the other transactional RPCs. If such a thing existed
> would the zombie fencing it provided be sufficient to provide safe
> semantics for running a non-EOS connector in an EOS-capable cluster?
>
> The endpoint for zombie fencing: It's not described how this works when
> exactly.once.source.enabled=false
>
> Cheers,
>
> Tom
>

Re: [DISCUSS] KIP-618: Atomic commit of source connector records and offsets

Reply via email to