Chris,

Thanks for the quick reply.  One thing to clarify from last time we talked to 
you.  We resolved the incident last month by your recommendation (increasing 
batch.size and linger.ms for the config topic's producer), but we worry about 
this issue might come back when we ramp more KC connector/tasks in the future.  
Since it will take quite some time/effort to get the fix into Kafka upstream so 
we want to proactively address this issue by giving us another knob to turn 
during incident crunching time.

For the environment in our company, it is more than 500 tasks.  500 tasks is 
for one topic in that cluster but we have other topics also running through KC 
in that cluster and they need to share the same config topic.  The total number 
is more like 5000 tasks.  And we are planning to increase the parallelism more 
in the near future.

During the incident time, the consumer groups keeps rebalancing and there is 
new coordinator being elected very frequently and all tasks' config needs to be 
written to config topic very frequently due to those rebalances.  There are 
several round trip operations needs to finish with that hard-coded 30 second 
(configTopic consumer readToEnd, putting task config into configTopic for each 
task synchronously, readToEnd again, write commit message, readToEnd again), 
the consumer.readToEnd can also be long for a newly elected coordinator when 
there are many messages in the config topic accumulated in the topic due to 
quick consumer group rebalances (and the compact thread has not got the time to 
compact the topic).

We worried about that hard-coded 30 second ceiling in a large cluster during 
incident time when lots of consumer rebalance was happening, we want to have 
some more ceiling buffers.

Between the current hard-coded ceiling (30 seconds) and the default 
max.poll.timeout (300 seconds) for unhealthy worker detection, there seems some 
room we can tune.

For the suggestion you mentioned to allow tuning the linger.ms only for the 
config topic but not for the status topic, this is already there.  You can 
specify connect.config.storage.linger.ms to do exactly that.
 
For the suggestion that setting higher default of linger.ms for config topic, I 
am not sure whether that works for all users.  Some users with small cluster 
might prefer 0 linger.ms for shorter latency.

Reply via email to