Yeah, you'd set the key.converter and/or value.converter in your connector config.
-Ewen On Thu, Jan 5, 2017 at 9:50 PM, Stephane Maarek < steph...@simplemachines.com.au> wrote: > Thanks! > So I just override the conf while doing the API call? It’d be great to > have this documented somewhere on the confluent website. I couldn’t find > it. > > On 6 January 2017 at 3:42:45 pm, Ewen Cheslack-Postava (e...@confluent.io) > wrote: > > On Thu, Jan 5, 2017 at 7:19 PM, Stephane Maarek < > steph...@simplemachines.com.au> wrote: > >> Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I >> just need to figure out how our CD pipeline can talk to our connect cluster >> securely (because it’ll need direct access to perform API calls). >> > > The documentation isn't great here, but you can apply all the normal > security configs to Connect (in distributed mode, it's basically equivalent > to a consumer, so everything you can do with a consumer you can do with > Connect). > > >> >> Lastly, a question or maybe a piece of feedback… is it not possible to >> specify the key serializer and deserializer as part of the rest api job >> config? >> The issue is that sometimes our data is avro, sometimes it’s json. And it >> seems I’d need two separate clusters for that? >> > > This is new! As of 0.10.1.0, we have https://cwiki.apache.org/ > confluence/display/KAFKA/KIP-75+-+Add+per-connector+Converters which > allows you to include it in the connector config. It's called "Converter" > in Connect because it does a bit more than ser/des if you've written them > for Kafka, but they are basically just pluggable ser/des. We knew folks > would want this, it just took us awhile to find the bandwidth to implement > it. Now, you shouldn't need to do anything special or deploy multiple > clusters -- it's baked in and supported as long as you are willing to > override it on a per-connector basis (and this seems reasonable for most > folks since *ideally* you are *somewhat* standardized on a common > serialization format). > > -Ewen > > >> >> On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (e...@confluent.io) >> wrote: >> >> On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek < >> steph...@simplemachines.com.au> wrote: >> >> > Hi, >> > >> > We like to operate in micro-services (dockerize and ship everything on >> ecs) >> > and I was wondering which approach was preferred. >> > We have one kafka cluster, one zookeeper cluster, etc, but when it >> comes to >> > kafka connect I have some doubts. >> > >> > Is it better to have one big kafka connect with multiple nodes, or many >> > small kafka connect clusters or standalone, for each connector / etl ? >> > >> >> You can do any of these, and it may depend on how you do >> orchestration/deployment. >> >> We built Connect to support running one big cluster running a bunch of >> connectors. It balances work automatically and provides a way to control >> scale up/down via increased parallelism. This means we don't need to make >> any assumptions about how you deploy, how you handle elastically scaling >> your clusters, etc. But if you run in an environment and have the tooling >> in place to do that already, you can also opt to run many smaller clusters >> and use that tooling to scale up/down. In that case you'd just make sure >> there were enough tasks for each connector so that when you scale the # of >> workers for a cluster up the rebalancing of work would ensure there was >> enough tasks for every worker to remain occupied. >> >> The main drawback of doing this is that Connect uses a few topics to for >> configs, status, and offsets and you need these to be unique per cluster. >> This means you'll have 3N more topics. If you're running a *lot* of >> connectors, that could eventually become a problem. It also means you have >> that many more worker configs to handle, clusters to monitor, etc. And >> deploying a connector no longer becomes as simple as just making a call to >> the service's REST API since there isn't a single centralized service. The >> main benefits I can think of are a) if you already have preferred tooling >> for handling elasticity and b) better resource isolation between >> connectors >> (i.e. an OOM error in one connector won't affect any other connectors). >> >> For standalone mode, we'd generally recommend only using it when >> distributed mode doesn't make sense, e.g. for log file collection. Other >> than that, having the fault tolerance and high availability of distributed >> mode is preferred. >> >> On your specific points: >> >> > >> > The issues I’m trying to address are : >> > - Integration with our CI/CD pipeline >> > >> >> I'm not sure anything about Connect affects this. Is there a specific >> concern you have about the CI/CD pipeline & Connect? >> >> >> > - Efficient resources utilisation >> > >> >> Putting all the connectors into one cluster will probably result in better >> resource utilization unless you're already automatically tracking usage >> and >> scaling appropriately. The reason is that if you use a bunch of small >> clusters, you're now stuck trying to optimize N uses. Since Connect can >> already (roughly) balance work, putting all the work into one cluster and >> having connect split it up means you just need to watch utilization of the >> nodes in that one cluster and scale up or down as appropriate. >> >> >> > - Easily add new jar files that connectors depend on with minimal >> downtime >> > >> >> This one is a bit interesting. You shouldn't have any downtime adding jars >> in the sense that you can do rolling bounces of Connect. The one caveat is >> that the current limitation for how it rebalances work involves halting >> work for all connectors/tasks, doing the rebalance, and then starting them >> up again. We plan to improve this, but the timeframe for it is still >> uncertain. Usually these rebalance steps should be pretty quick. The main >> reason this can be a concern is that halting some connectors could take >> some time (e.g. because they need to fully flush their data). This means >> the period of time your connectors are not processing data during one of >> those rebalances is controlled by the "worst" connector. >> >> I would recommend trying a single cluster but monitoring whether you see >> stalls due to rebalances. If you do, then moving to multiple clusters >> might >> make sense. (This also, obviously, depends a lot on your SLA for data >> delivery.) >> >> >> > - Monitoring operations >> > >> >> Multiple clusters definitely seems messier and more complicated for this. >> There will be more workers in a single cluster, but it's a single service >> you need to monitor and maintain. >> >> Hope that helps! >> >> -Ewen >> >> >> > >> > Thanks for your guidance >> > >> > Regards, >> > Stephane >> > >> >> >