[ 
https://issues.apache.org/jira/browse/KAFKA-9385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010955#comment-17010955
 ] 

Randall Hauch commented on KAFKA-9385:
--------------------------------------

[~kaikai.hou], what version of Apache Kafka Connect are you using? AK 2.4.0 
includes several fixes that avoid splitbrain and zombie tasks (see KAFKA-9184), 
and although it's been backported to the {{2.3}} branch it AK 2.3.2 has not yet 
been released. 

If you've used an AK version prior to 2.4.0, could you try using AK 2.4.0 and 
see if the same problem persists.

If you did use AK 2.4.0, then this might be an issue that was not fixed in 
KAFKA-9184, and to properly identify and solve the problem we'd need more 
information:
# What is your worker configuration? Ideally you can provide sanitized worker 
config properties files, or if that's not practical the log lines from each 
worker process that show the worker config.
# Have you seen INFO-level log messages that include "Broker coordinator was 
unreachable" and/or DEBUG-level log messages that include phrases like "lost 
tasks"?
# Upload a DEBUG log for all workers including the problematic split brain 
problem plus some number of lines before and after (see KAFKA-9184 for a 
similar summary).

Thanks in advance!

> Connect cluster: connector task repeat like a splitbrain cluster problem 
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-9385
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9385
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>            Reporter: kaikai.hou
>            Priority: Major
>         Attachments: 12_31_d8c7j_1.jpg
>
>
> I am using Debezium. And find a task repeat 
> problem.[Jump|[https://issues.redhat.com/browse/DBZ-1573?jql=key%20in%20watchedIssues()]]
>  
> 1. I push the Debezium image to our private image repository.
> 2. Deploy the connect cluster with the following *Deployment Config*:
> {code:java}
> //代码占位符
> apiVersion: apps.openshift.io/v1
> kind: DeploymentConfig
> metadata:
>   annotations:
>     openshift.io/generated-by: OpenShiftWebConsole
>   creationTimestamp: '2019-10-14T07:45:41Z'
>   generation: 29
>   labels:
>     app: debezium-test-cloud
>   name: debezium-test-cloud
>   namespace: test
>   resourceVersion: '168496156'
>   selfLink: >-
>     
> /apis/apps.openshift.io/v1/namespaces/test/deploymentconfigs/debezium-test-cloud
>   uid: 9f4f8f4d-ee56-11e9-a5a1-00163e0e008f
> spec:
>   replicas: 2
>   selector:
>     app: debezium-test-cloud
>     deploymentconfig: debezium-test-cloud
>   strategy:
>     activeDeadlineSeconds: 21600
>     resources: {}
>     rollingParams:
>       intervalSeconds: 1
>       maxSurge: 25%
>       maxUnavailable: 25%
>       timeoutSeconds: 600
>       updatePeriodSeconds: 1
>     type: Rolling
>   template:
>     metadata:
>       annotations:
>         openshift.io/generated-by: OpenShiftWebConsole
>       creationTimestamp: null
>       labels:
>         app: debezium-test-cloud
>         deploymentconfig: debezium-test-cloud
>     spec:
>       containers:
>         - env:
>             - name: BOOTSTRAP_SERVERS
>               value: '192.168.100.228:9092'
>             - name: GROUP_ID
>               value: test-cloud
>             - name: CONFIG_STORAGE_TOPIC
>               value: base.test-cloud.config
>             - name: OFFSET_STORAGE_TOPIC
>               value: base.test-cloud.offset
>             - name: STATUS_STORAGE_TOPIC
>               value: base.test-cloud.status
>             - name: CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE
>               value: 'true'
>             - name: CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE
>               value: 'true'
>             - name: CONNECT_PRODUCER_MAX_REQUEST_SIZE
>               value: '20971520'
>             - name: CONNECT_DATABASE_HISTORY_KAFKA_RECOVERY_POLL_INTERVAL_MS
>               value: '1000'
>             - name: HEAP_OPTS
>               value: '-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0'
>           image: 
> 'registry.cn-hangzhou.aliyuncs.com/eshine/debeziumconnect:1.0.0.Beta2'
>           imagePullPolicy: IfNotPresent
>           name: debezium-test-cloud
>           ports:
>             - containerPort: 8083
>               protocol: TCP
>             - containerPort: 8778
>               protocol: TCP
>             - containerPort: 9092
>               protocol: TCP
>             - containerPort: 9779
>               protocol: TCP
>           resources:
>             limits:
>               cpu: 400m
>               memory: 1Gi
>             requests:
>               cpu: 200m
>               memory: 1Gi
>           terminationMessagePath: /dev/termination-log
>           terminationMessagePolicy: File
>           volumeMounts:
>             - mountPath: /kafka/config
>               name: debezium-test-cloud-1
>             - mountPath: /kafka/data
>               name: debezium-test-cloud-2
>             - mountPath: /kafka/logs
>               name: debezium-test-cloud-3
>       dnsPolicy: ClusterFirst
>       restartPolicy: Always
>       schedulerName: default-scheduler
>       securityContext: {}
>       terminationGracePeriodSeconds: 30
>       volumes:
>         - emptyDir: {}
>           name: debezium-test-cloud-1
>         - emptyDir: {}
>           name: debezium-test-cloud-2
>         - emptyDir: {}
>           name: debezium-test-cloud-3
>   test: false
>   triggers:
>     - type: ConfigChange
> status:
>   availableReplicas: 2
>   conditions:
>     - lastTransitionTime: '2019-11-25T06:44:30Z'
>       lastUpdateTime: '2019-11-25T06:44:44Z'
>       message: replication controller "debezium-test-cloud-15" successfully 
> rolled out
>       reason: NewReplicationControllerAvailable
>       status: 'True'
>       type: Progressing
>     - lastTransitionTime: '2019-12-31T10:06:23Z'
>       lastUpdateTime: '2019-12-31T10:06:23Z'
>       message: Deployment config has minimum availability.
>       status: 'True'
>       type: Available
>   details:
>     causes:
>       - type: Manual
>     message: manual change
>   latestVersion: 15
>   observedGeneration: 29
>   readyReplicas: 2
>   replicas: 2
>   unavailableReplicas: 0
>   updatedReplicas: 2
> {code}
> 3. Connect cluster in openshift: one service with two pods
> 4.  
>      a). task_connector_1_0 and task_connector_3_0 were running in podA; 
> task_connector_2_0 was running in PodB
>      b) Then, PodA console follows error log:  In attachment 
> "12_31_d8c7j_1.jpg" 
>         !12_31_d8c7j_1.jpg!
>      c) Then, Rebalance started;
>      d) However, In PodB, all task (task_connector_1_0, task_connector_2_0, 
> task_connector_3_0) are running.  In PodA, still task_connector_1_0 and 
> task_connector_3_0.
>      e) So the repeat task appeared.
>  
>     



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to