[ 
https://issues.apache.org/jira/browse/KAFKA-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray resolved KAFKA-13335.
-------------------------------
    Resolution: Not A Problem

Finally got back to this after a long time. This is no bug or fault of Kafka 
Connect. We have a lot of connectors, so it takes a while to rebalance all of 
them. We were simply constantly hitting the rebalance.timeout.ms, leaving us in 
an endless loop of rebalancing. Not sure what changed between 2.7.0 and 2.8.0 
to enforce this timeout or to lengthen the time to rebalance, but something 
did. Bumped the timeout to 3 minutes from 1 minute and we are good to go! 

> Upgrading connect from 2.7.0 to 2.8.0 causes worker instability
> ---------------------------------------------------------------
>
>                 Key: KAFKA-13335
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13335
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: John Gray
>            Priority: Major
>         Attachments: image-2021-09-29-09-15-18-172.png
>
>
> After recently upgrading our connect cluster to 2.8.0 (via 
> strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the 
> cluster is struggling to stabilize. Connectors are being 
> unassigned/reassigned/duplicated continuously, and never settling back down. 
> A downgrade back to 2.7.0 fixes things immediately. I have attached a picture 
> of our Grafana dashboards showing some metrics. We have a connect cluster 
> with 4 nodes, trying to maintain about 1000 connectors, each connector with a 
> maxTask of 1. 
> We are noticing a slow increase in memory usage with big random peaks of 
> tasks counts and thread counts.
> I do also notice over the course of letting 2.8.0 run a huge increase in logs 
> stating that {code}ERROR Graceful stop of task (task name here) 
> failed.{code}, but the logs do not seem to indicate a reason. The connector 
> appears to be stopped only seconds after its creation. It appears to only 
> affect our source connectors. These logs stop after downgrading back to 2.7.0.
> I am also seeing an increase in logs stating that {code}Couldn't instantiate 
> task (task name) because it has an invalid task configuration. This task will 
> not execute until reconfigured. 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
> [StartAndStopExecutor-connect-1-1]
> org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
> worker: (task name)
>       at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834){code}
> I am not sure what could be causing this, any insight would be appreciated! 
> I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
> (KAFKA-10413). Is that fix potentially causing instability? 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to