Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I
just need to figure out how our CD pipeline can talk to our connect cluster
securely (because it’ll need direct access to perform API calls).

Lastly, a question or maybe a piece of feedback… is it not possible to
specify the key serializer and deserializer as part of the rest api job
config?
The issue is that sometimes our data is avro, sometimes it’s json. And it
seems I’d need two separate clusters for that?

On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (e...@confluent.io)
wrote:

On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek <
steph...@simplemachines.com.au> wrote:

> Hi,
>
> We like to operate in micro-services (dockerize and ship everything on
ecs)
> and I was wondering which approach was preferred.
> We have one kafka cluster, one zookeeper cluster, etc, but when it comes
to
> kafka connect I have some doubts.
>
> Is it better to have one big kafka connect with multiple nodes, or many
> small kafka connect clusters or standalone, for each connector / etl ?
>

You can do any of these, and it may depend on how you do
orchestration/deployment.

We built Connect to support running one big cluster running a bunch of
connectors. It balances work automatically and provides a way to control
scale up/down via increased parallelism. This means we don't need to make
any assumptions about how you deploy, how you handle elastically scaling
your clusters, etc. But if you run in an environment and have the tooling
in place to do that already, you can also opt to run many smaller clusters
and use that tooling to scale up/down. In that case you'd just make sure
there were enough tasks for each connector so that when you scale the # of
workers for a cluster up the rebalancing of work would ensure there was
enough tasks for every worker to remain occupied.

The main drawback of doing this is that Connect uses a few topics to for
configs, status, and offsets and you need these to be unique per cluster.
This means you'll have 3N more topics. If you're running a *lot* of
connectors, that could eventually become a problem. It also means you have
that many more worker configs to handle, clusters to monitor, etc. And
deploying a connector no longer becomes as simple as just making a call to
the service's REST API since there isn't a single centralized service. The
main benefits I can think of are a) if you already have preferred tooling
for handling elasticity and b) better resource isolation between connectors
(i.e. an OOM error in one connector won't affect any other connectors).

For standalone mode, we'd generally recommend only using it when
distributed mode doesn't make sense, e.g. for log file collection. Other
than that, having the fault tolerance and high availability of distributed
mode is preferred.

On your specific points:

>
> The issues I’m trying to address are :
> - Integration with our CI/CD pipeline
>

I'm not sure anything about Connect affects this. Is there a specific
concern you have about the CI/CD pipeline & Connect?


> - Efficient resources utilisation
>

Putting all the connectors into one cluster will probably result in better
resource utilization unless you're already automatically tracking usage and
scaling appropriately. The reason is that if you use a bunch of small
clusters, you're now stuck trying to optimize N uses. Since Connect can
already (roughly) balance work, putting all the work into one cluster and
having connect split it up means you just need to watch utilization of the
nodes in that one cluster and scale up or down as appropriate.


> - Easily add new jar files that connectors depend on with minimal
downtime
>

This one is a bit interesting. You shouldn't have any downtime adding jars
in the sense that you can do rolling bounces of Connect. The one caveat is
that the current limitation for how it rebalances work involves halting
work for all connectors/tasks, doing the rebalance, and then starting them
up again. We plan to improve this, but the timeframe for it is still
uncertain. Usually these rebalance steps should be pretty quick. The main
reason this can be a concern is that halting some connectors could take
some time (e.g. because they need to fully flush their data). This means
the period of time your connectors are not processing data during one of
those rebalances is controlled by the "worst" connector.

I would recommend trying a single cluster but monitoring whether you see
stalls due to rebalances. If you do, then moving to multiple clusters might
make sense. (This also, obviously, depends a lot on your SLA for data
delivery.)


> - Monitoring operations
>

Multiple clusters definitely seems messier and more complicated for this.
There will be more workers in a single cluster, but it's a single service
you need to monitor and maintain.

Hope that helps!

-Ewen


>
> Thanks for your guidance
>
> Regards,
> Stephane
>

Reply via email to