[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-07-24 Thread Sagar Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746774#comment-17746774
 ] 

Sagar Rao commented on KAFKA-15005:
---

[~LucentWong], the PR for https://issues.apache.org/jira/browse/KAFKA-12525 has 
been merged. Closing this one. Thanks!

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"worker-1","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"worker-2","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-05-21 Thread Sagar Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724615#comment-17724615
 ] 

Sagar Rao commented on KAFKA-15005:
---

Thank you [~LucentWong] , I have updated the condition in the PR to include 
stale UNASSIGNED status from both previous and current generation. 

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"worker-1","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"worker-2","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-05-20 Thread Yu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724518#comment-17724518
 ] 

Yu Wang commented on KAFKA-15005:
-

[~sagarrao] Sorry for the confusion.

Yes, the RUNNIG status comes from worker-1 and UNASSIGNED status comes from 
worker-2.

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"x","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"x","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-05-19 Thread Sagar Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724481#comment-17724481
 ] 

Sagar Rao commented on KAFKA-15005:
---

Thanks for the comment [~LucentWong] . Yeah in that PR, I had avoided the 
current generation. One thing I wanted to check with you (since you have 
rightfully masked the worker_ids from the status topic dump), the RUNNING 
message that you see corresponds to worker_1 i.e the worker where the task 
started and the UNASSIGNED status message belongs to worker_2? I am pretty sure 
that is the case but just being double sure. In that case, it should be very 
safe to add that check of generation equality as well.

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"x","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"x","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-05-18 Thread Yu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724071#comment-17724071
 ] 

Yu Wang commented on KAFKA-15005:
-

[~sagarrao] I check the PR in 
https://issues.apache.org/jira/browse/KAFKA-12525, looks like it tries to fence 
the stale status from previous generation. But from the status topic I dumped, 
the UNASSIGNED status has the same generation with RUNNING status.

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"x","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"x","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-05-18 Thread Yu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724070#comment-17724070
 ] 

Yu Wang commented on KAFKA-15005:
-

Thanks, [~sagarrao] . Yes, looks like the this ticket is duplicated.

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"x","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"x","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15005) Status of KafkaConnect task not correct

2023-05-18 Thread Sagar Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723885#comment-17723885
 ] 

Sagar Rao commented on KAFKA-15005:
---

hey [~LucentWong] thanks for reporting this. This looks very similar to 
https://issues.apache.org/jira/browse/KAFKA-12525 and has been present for a 
while. There's also a PR submitted for that bug. 

> Status of KafkaConnect task not correct
> ---
>
> Key: KAFKA-15005
> URL: https://issues.apache.org/jira/browse/KAFKA-15005
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.5.1, 3.0.0, 3.3.2
>Reporter: Yu Wang
>Priority: Major
>
> Our MM2 is running version 2.5.1.
> After a rebalance of our MM2 source tasks, we found there were several tasks 
> always in *UNASSIGNED* status, even the real tasks already started. 
> So we dump the payload of the status topic of Kafka Connect, and found the 
> last two status change is status *RUNNING* followed by status 
> {*}UNASSIGNED{*}.
> {code:java}
> LogAppendTime: 1684167885961 keysize: 64 valuesize: 95 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"RUNNING","trace":null,"worker_id":"x","generation":437643}
> LogAppendTime: 1684167903456 keysize: 64 valuesize: 98 sequence: -1 
> headerKeys: [] key: task-7 payload: 
> {"state":"UNASSIGNED","trace":null,"worker_id":"x","generation":437643}
>  {code}
> But usually, the RUNNING status should be appended after the UNASSIGNED, 
> because the worker coordinator will revoked the tasks before start new tasks.
> Then we checked the log of our MM2 worker. And found that, during that time, 
> there was a task that revoked on worker-2 and started on worker-1.
>  
> Worker-1
> {code:java}
> [2023-05-15 09:24:45,950] INFO [Worker clientId=connect-1, 
> groupId=__group] Starting task task-7 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> [2023-05-15 09:24:45,951] INFO Creating task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
> Worker-2
> {code:java}
> [2023-05-15 09:24:40,922] INFO Stopping task task-7 
> (org.apache.kafka.connect.runtime.Worker) {code}
>  
> So I think the incorrect status was caused by the revoked task finished later 
> than the new started task, which made the UNASSIGNED status append to that 
> status topic after the RUNNING status. 
>  
> After reading the code of DistributeHerder, I found that the task revoking is 
> running in a thread pool, the revoke operation just return after submit all 
> the callables. So I think even in the same worker, there is not a guarantee 
> that the revoke operation will always finish before the new tasks start.
> {code:java}
> for (final ConnectorTaskId taskId : tasks) {
> callables.add(getTaskStoppingCallable(taskId));
> }
> // The actual timeout for graceful task/connector stop is applied in worker's
> // stopAndAwaitTask/stopAndAwaitConnector methods.
> startAndStop(callables);
> log.info("Finished stopping tasks in preparation for rebalance"); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)