[jira] [Commented] (KAFKA-13459) MM2 should be able to add the source offset to the record header

2024-04-22 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839717#comment-17839717
 ] 

Viktor Somogyi-Vass commented on KAFKA-13459:
-

[~durban] have you planned to move ahead with this? Does this require a KIP?

> MM2 should be able to add the source offset to the record header
> 
>
> Key: KAFKA-13459
> URL: https://issues.apache.org/jira/browse/KAFKA-13459
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> MM2 could add the source offset to the record header to help with diagnostics 
> in some use-cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16596) Flaky test – org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()

2024-04-22 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-16596:

Fix Version/s: 3.6.3

> Flaky test – 
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  
> ---
>
> Key: KAFKA-16596
> URL: https://issues.apache.org/jira/browse/KAFKA-16596
> Project: Kafka
>  Issue Type: Test
>Reporter: Igor Soarez
>Assignee: Andras Katona
>Priority: Major
>  Labels: GoodForNewContributors, good-first-issue
> Fix For: 3.8.0, 3.7.1, 3.6.3
>
>
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  failed in the following way:
>  
> {code:java}
> org.opentest4j.AssertionFailedError: Unexpected addresses [93.184.215.14, 
> 2606:2800:21f:cb07:6820:80da:af6b:8b2c] ==> expected:  but was:  
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)  
>   at 
> app//org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup(ClientUtilsTest.java:65)
>  {code}
> As a result of the following assertions:
>  
> {code:java}
> // With lookup of example.com, either one or two addresses are expected 
> depending on
> // whether ipv4 and ipv6 are enabled
> List validatedAddresses = 
> checkWithLookup(asList("example.com:1"));
> assertTrue(validatedAddresses.size() >= 1, "Unexpected addresses " + 
> validatedAddresses);
> List validatedHostNames = 
> validatedAddresses.stream().map(InetSocketAddress::getHostName)
> .collect(Collectors.toList());
> List expectedHostNames = asList("93.184.216.34", 
> "2606:2800:220:1:248:1893:25c8:1946"); {code}
> It seems that the DNS result has changed for example.com.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16596) Flaky test – org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()

2024-04-22 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839636#comment-17839636
 ] 

Viktor Somogyi-Vass commented on KAFKA-16596:
-

Cherry-picked this onto 3.7 and 3.6 too.

> Flaky test – 
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  
> ---
>
> Key: KAFKA-16596
> URL: https://issues.apache.org/jira/browse/KAFKA-16596
> Project: Kafka
>  Issue Type: Test
>Reporter: Igor Soarez
>Assignee: Andras Katona
>Priority: Major
>  Labels: GoodForNewContributors, good-first-issue
> Fix For: 3.8.0, 3.7.1, 3.6.3
>
>
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  failed in the following way:
>  
> {code:java}
> org.opentest4j.AssertionFailedError: Unexpected addresses [93.184.215.14, 
> 2606:2800:21f:cb07:6820:80da:af6b:8b2c] ==> expected:  but was:  
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)  
>   at 
> app//org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup(ClientUtilsTest.java:65)
>  {code}
> As a result of the following assertions:
>  
> {code:java}
> // With lookup of example.com, either one or two addresses are expected 
> depending on
> // whether ipv4 and ipv6 are enabled
> List validatedAddresses = 
> checkWithLookup(asList("example.com:1"));
> assertTrue(validatedAddresses.size() >= 1, "Unexpected addresses " + 
> validatedAddresses);
> List validatedHostNames = 
> validatedAddresses.stream().map(InetSocketAddress::getHostName)
> .collect(Collectors.toList());
> List expectedHostNames = asList("93.184.216.34", 
> "2606:2800:220:1:248:1893:25c8:1946"); {code}
> It seems that the DNS result has changed for example.com.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16596) Flaky test – org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()

2024-04-22 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-16596:

Fix Version/s: 3.7.1

> Flaky test – 
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  
> ---
>
> Key: KAFKA-16596
> URL: https://issues.apache.org/jira/browse/KAFKA-16596
> Project: Kafka
>  Issue Type: Test
>Reporter: Igor Soarez
>Assignee: Andras Katona
>Priority: Major
>  Labels: GoodForNewContributors, good-first-issue
> Fix For: 3.8.0, 3.7.1
>
>
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  failed in the following way:
>  
> {code:java}
> org.opentest4j.AssertionFailedError: Unexpected addresses [93.184.215.14, 
> 2606:2800:21f:cb07:6820:80da:af6b:8b2c] ==> expected:  but was:  
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)  
>   at 
> app//org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup(ClientUtilsTest.java:65)
>  {code}
> As a result of the following assertions:
>  
> {code:java}
> // With lookup of example.com, either one or two addresses are expected 
> depending on
> // whether ipv4 and ipv6 are enabled
> List validatedAddresses = 
> checkWithLookup(asList("example.com:1"));
> assertTrue(validatedAddresses.size() >= 1, "Unexpected addresses " + 
> validatedAddresses);
> List validatedHostNames = 
> validatedAddresses.stream().map(InetSocketAddress::getHostName)
> .collect(Collectors.toList());
> List expectedHostNames = asList("93.184.216.34", 
> "2606:2800:220:1:248:1893:25c8:1946"); {code}
> It seems that the DNS result has changed for example.com.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16596) Flaky test – org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()

2024-04-22 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-16596.
-
Fix Version/s: 3.8.0
 Assignee: Andras Katona
   Resolution: Fixed

> Flaky test – 
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  
> ---
>
> Key: KAFKA-16596
> URL: https://issues.apache.org/jira/browse/KAFKA-16596
> Project: Kafka
>  Issue Type: Test
>Reporter: Igor Soarez
>Assignee: Andras Katona
>Priority: Major
>  Labels: GoodForNewContributors, good-first-issue
> Fix For: 3.8.0
>
>
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  failed in the following way:
>  
> {code:java}
> org.opentest4j.AssertionFailedError: Unexpected addresses [93.184.215.14, 
> 2606:2800:21f:cb07:6820:80da:af6b:8b2c] ==> expected:  but was:  
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)  
>   at 
> app//org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup(ClientUtilsTest.java:65)
>  {code}
> As a result of the following assertions:
>  
> {code:java}
> // With lookup of example.com, either one or two addresses are expected 
> depending on
> // whether ipv4 and ipv6 are enabled
> List validatedAddresses = 
> checkWithLookup(asList("example.com:1"));
> assertTrue(validatedAddresses.size() >= 1, "Unexpected addresses " + 
> validatedAddresses);
> List validatedHostNames = 
> validatedAddresses.stream().map(InetSocketAddress::getHostName)
> .collect(Collectors.toList());
> List expectedHostNames = asList("93.184.216.34", 
> "2606:2800:220:1:248:1893:25c8:1946"); {code}
> It seems that the DNS result has changed for example.com.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16596) Flaky test – org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()

2024-04-22 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839624#comment-17839624
 ] 

Viktor Somogyi-Vass commented on KAFKA-16596:
-

[~soarez] just merged the PR a couple of minutes ago and I didn't notice this 
issue, sorry for that. I'll resolve this ticket.

> Flaky test – 
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  
> ---
>
> Key: KAFKA-16596
> URL: https://issues.apache.org/jira/browse/KAFKA-16596
> Project: Kafka
>  Issue Type: Test
>Reporter: Igor Soarez
>Priority: Major
>  Labels: GoodForNewContributors, good-first-issue
>
> org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
>  failed in the following way:
>  
> {code:java}
> org.opentest4j.AssertionFailedError: Unexpected addresses [93.184.215.14, 
> 2606:2800:21f:cb07:6820:80da:af6b:8b2c] ==> expected:  but was:  
>   at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)  
>   at 
> app//org.apache.kafka.clients.ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup(ClientUtilsTest.java:65)
>  {code}
> As a result of the following assertions:
>  
> {code:java}
> // With lookup of example.com, either one or two addresses are expected 
> depending on
> // whether ipv4 and ipv6 are enabled
> List validatedAddresses = 
> checkWithLookup(asList("example.com:1"));
> assertTrue(validatedAddresses.size() >= 1, "Unexpected addresses " + 
> validatedAddresses);
> List validatedHostNames = 
> validatedAddresses.stream().map(InetSocketAddress::getHostName)
> .collect(Collectors.toList());
> List expectedHostNames = asList("93.184.216.34", 
> "2606:2800:220:1:248:1893:25c8:1946"); {code}
> It seems that the DNS result has changed for example.com.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-15649) Handle directory failure timeout

2024-04-11 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836080#comment-17836080
 ] 

Viktor Somogyi-Vass edited comment on KAFKA-15649 at 4/11/24 9:37 AM:
--

[~soarez] I've uploaded a PR to start the review process early. I still would 
like to do some manual testing too but would be happy to receive some feedback 
once you get some time for this.


was (Author: viktorsomogyi):
[~soarez] I've uploaded a PR to start the review process early. I still would 
like to do some manual testing too but would be happy to receive some feedback.

> Handle directory failure timeout 
> -
>
> Key: KAFKA-15649
> URL: https://issues.apache.org/jira/browse/KAFKA-15649
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Viktor Somogyi-Vass
>Priority: Minor
>
> If a broker with an offline log directory continues to fail to notify the 
> controller of either:
>  * the fact that the directory is offline; or
>  * of any replica assignment into a failed directory
> then the controller will not check if a leadership change is required, and 
> this may lead to partitions remaining indefinitely offline.
> KIP-858 proposes that the broker should shut down after a configurable 
> timeout to force a leadership change. Alternatively, the broker could also 
> request to be fenced, as long as there's a path for it to later become 
> unfenced.
> While this unavailability is possible in theory, in practice it's not easy to 
> entertain a scenario where a broker continues to appear as healthy before the 
> controller, but fails to send this information. So it's not clear if this is 
> a real problem. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15649) Handle directory failure timeout

2024-03-11 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-15649:
---

Assignee: Viktor Somogyi-Vass

> Handle directory failure timeout 
> -
>
> Key: KAFKA-15649
> URL: https://issues.apache.org/jira/browse/KAFKA-15649
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Viktor Somogyi-Vass
>Priority: Minor
>
> If a broker with an offline log directory continues to fail to notify the 
> controller of either:
>  * the fact that the directory is offline; or
>  * of any replica assignment into a failed directory
> then the controller will not check if a leadership change is required, and 
> this may lead to partitions remaining indefinitely offline.
> KIP-858 proposes that the broker should shut down after a configurable 
> timeout to force a leadership change. Alternatively, the broker could also 
> request to be fenced, as long as there's a path for it to later become 
> unfenced.
> While this unavailability is possible in theory, in practice it's not easy to 
> entertain a scenario where a broker continues to appear as healthy before the 
> controller, but fails to send this information. So it's not clear if this is 
> a real problem. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15649) Handle directory failure timeout

2024-03-08 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824815#comment-17824815
 ] 

Viktor Somogyi-Vass commented on KAFKA-15649:
-

[~soarez] would you mind if I picked this up?

> Handle directory failure timeout 
> -
>
> Key: KAFKA-15649
> URL: https://issues.apache.org/jira/browse/KAFKA-15649
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Priority: Minor
>
> If a broker with an offline log directory continues to fail to notify the 
> controller of either:
>  * the fact that the directory is offline; or
>  * of any replica assignment into a failed directory
> then the controller will not check if a leadership change is required, and 
> this may lead to partitions remaining indefinitely offline.
> KIP-858 proposes that the broker should shut down after a configurable 
> timeout to force a leadership change. Alternatively, the broker could also 
> request to be fenced, as long as there's a path for it to later become 
> unfenced.
> While this unavailability is possible in theory, in practice it's not easy to 
> entertain a scenario where a broker continues to appear as healthy before the 
> controller, but fails to send this information. So it's not clear if this is 
> a real problem. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-13949) Connect /connectors endpoint should support querying the active topics and the task configs

2024-02-05 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814311#comment-17814311
 ] 

Viktor Somogyi-Vass commented on KAFKA-13949:
-

[~wasniksudesh] feel free to ping me when your PR is ready, I'm happy to review 
it.

> Connect /connectors endpoint should support querying the active topics and 
> the task configs
> ---
>
> Key: KAFKA-13949
> URL: https://issues.apache.org/jira/browse/KAFKA-13949
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect
>Affects Versions: 3.2.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Sudesh Wasnik
>Priority: Major
>
> The /connectors endpoint supports the "expand" query parameter, which acts as 
> a set of queried categories, currently supporting info (config) and status 
> (monitoring status).
> The endpoint should also support adding the active topics of a connector, and 
> adding the separate task configs, too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15366) Log directory failure integration test

2023-12-13 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796284#comment-17796284
 ] 

Viktor Somogyi-Vass commented on KAFKA-15366:
-

[~soarez] I think I'll separate testing the existing functionality and the 
other tests you suggested into multiple PRs so we have some integration test 
coverage as soon as possible given the looming code freeze for 3.7 (20th).

> Log directory failure integration test
> --
>
> Key: KAFKA-15366
> URL: https://issues.apache.org/jira/browse/KAFKA-15366
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15992) Make MM2 heartbeats topic name configurable

2023-12-11 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15992:

Component/s: mirrormaker

> Make MM2 heartbeats topic name configurable
> ---
>
> Key: KAFKA-15992
> URL: https://issues.apache.org/jira/browse/KAFKA-15992
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Affects Versions: 3.7.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> With DefaultReplicationPolicy, the heartbeats topic name is hard-coded. 
> Instead, this should be configurable, so users can avoid collisions with the 
> "heartbeats" topics of other systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15992) Make MM2 heartbeats topic name configurable

2023-12-11 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-15992:
---

 Summary: Make MM2 heartbeats topic name configurable
 Key: KAFKA-15992
 URL: https://issues.apache.org/jira/browse/KAFKA-15992
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 3.7.0
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


With DefaultReplicationPolicy, the heartbeats topic name is hard-coded. 
Instead, this should be configurable, so users can avoid collisions with the 
"heartbeats" topics of other systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-13922) Unable to generate coverage reports for the whole project

2023-12-11 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-13922:

Labels: cloudera  (was: )

> Unable to generate coverage reports for the whole project
> -
>
> Key: KAFKA-13922
> URL: https://issues.apache.org/jira/browse/KAFKA-13922
> Project: Kafka
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.5.0
>Reporter: Patrik Nagy
>Assignee: Andras Katona
>Priority: Minor
>  Labels: cloudera
>
> It is documented in the project that if we need code coverage reports for the 
> whole project, we need to run something like this where we enabled the test 
> coverage flag and run the reportCoverage task:
> {code:java}
> ./gradlew reportCoverage -PenableTestCoverage=true -Dorg.gradle.parallel=false
> {code}
> If I run it, the build will fail in the end because of jacocoRootReport:
> {code:java}
> 14:34:41 > Task :jacocoRootReport FAILED
> 14:34:41 
> 14:34:41 FAILURE: Build failed with an exception.
> 14:34:41 
> 14:34:41 * What went wrong:
> 14:34:41 Some problems were found with the configuration of task 
> ':jacocoRootReport' (type 'JacocoReport').
> 14:34:41   - Type 'org.gradle.testing.jacoco.tasks.JacocoReport' property 
> 'jacocoClasspath' doesn't have a configured value.
> 14:34:41 
> 14:34:41 Reason: This property isn't marked as optional and no value has 
> been configured.
> 14:34:41 
> 14:34:41 Possible solutions:
> 14:34:41   1. Assign a value to 'jacocoClasspath'.
> 14:34:41   2. Mark property 'jacocoClasspath' as optional.
> 14:34:41 
> 14:34:41 Please refer to 
> https://docs.gradle.org/7.3.3/userguide/validation_problems.html#value_not_set
>  for more details about this problem.
> 14:34:41   - Type 'org.gradle.testing.jacoco.tasks.JacocoReport' property 
> 'reports.enabledReports.html.outputLocation' doesn't have a configured value.
> 14:34:41 
> 14:34:41 Reason: This property isn't marked as optional and no value has 
> been configured.
> 14:34:41 
> 14:34:41 Possible solutions:
> 14:34:41   1. Assign a value to 
> 'reports.enabledReports.html.outputLocation'.
> 14:34:41   2. Mark property 'reports.enabledReports.html.outputLocation' 
> as optional.
> 14:34:41 
> 14:34:41 Please refer to 
> https://docs.gradle.org/7.3.3/userguide/validation_problems.html#value_not_set
>  for more details about this problem.
> 14:34:41   - Type 'org.gradle.testing.jacoco.tasks.JacocoReport' property 
> 'reports.enabledReports.xml.outputLocation' doesn't have a configured value.
> 14:34:41 
> 14:34:41 Reason: This property isn't marked as optional and no value has 
> been configured.
> 14:34:41 
> 14:34:41 Possible solutions:
> 14:34:41   1. Assign a value to 
> 'reports.enabledReports.xml.outputLocation'.
> 14:34:41   2. Mark property 'reports.enabledReports.xml.outputLocation' 
> as optional. {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-15366) Log directory failure integration test

2023-12-05 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793324#comment-17793324
 ] 

Viktor Somogyi-Vass edited comment on KAFKA-15366 at 12/5/23 3:20 PM:
--

[~soarez] did you have anything in mind other than modifying 
{{kafka.server.LogDirFailureTest}} to run on KRaft clusters too? Although it's 
not named as an integration test, it is that as it subclasses the 
{{IntegrationTestHarness}} and covers some cases and it seems like we just need 
to parameterize it to run on KRaft clusters.
Are there any additional test cases besides those that we should think about?


was (Author: viktorsomogyi):
[~soarez] did you have anything in mind other than modifying 
{{kafka.server.LogDirFailureTest}} to run on KRaft clusters too? Are there any 
additional test cases besides those that we should think about?

> Log directory failure integration test
> --
>
> Key: KAFKA-15366
> URL: https://issues.apache.org/jira/browse/KAFKA-15366
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15366) Log directory failure integration test

2023-12-05 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793324#comment-17793324
 ] 

Viktor Somogyi-Vass commented on KAFKA-15366:
-

[~soarez] did you have anything in mind other than modifying 
{{kafka.server.LogDirFailureTest}} to run on KRaft clusters too? Are there any 
additional test cases besides those that we should think about?

> Log directory failure integration test
> --
>
> Key: KAFKA-15366
> URL: https://issues.apache.org/jira/browse/KAFKA-15366
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14127) KIP-858: Handle JBOD broker disk failure in KRaft

2023-12-05 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793321#comment-17793321
 ] 

Viktor Somogyi-Vass commented on KAFKA-14127:
-

I'll try one of the tests and perhaps review your changes to get some more 
context.

> KIP-858: Handle JBOD broker disk failure in KRaft
> -
>
> Key: KAFKA-14127
> URL: https://issues.apache.org/jira/browse/KAFKA-14127
> Project: Kafka
>  Issue Type: Improvement
>  Components: jbod, kraft
>Reporter: Igor Soarez
>Assignee: Igor Soarez
>Priority: Major
>  Labels: 4.0-blocker, kip-500, kraft
> Fix For: 3.7.0
>
>
> Supporting configurations with multiple storage directories in KRaft mode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15366) Log directory failure integration test

2023-12-05 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-15366:
---

Assignee: Viktor Somogyi-Vass

> Log directory failure integration test
> --
>
> Key: KAFKA-15366
> URL: https://issues.apache.org/jira/browse/KAFKA-15366
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14127) KIP-858: Handle JBOD broker disk failure in KRaft

2023-12-04 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792858#comment-17792858
 ] 

Viktor Somogyi-Vass edited comment on KAFKA-14127 at 12/4/23 1:50 PM:
--

[~soarez] is there anything that I can pick up from these that doesn't depend 
on other in progress tickets? I'd like to help to speed up the completion of 
this.


was (Author: viktorsomogyi):
[~soarez] is there anything that I can pick up from these that doesn't depend 
on other tickets? I'd like to help to speed up the completion of this.

> KIP-858: Handle JBOD broker disk failure in KRaft
> -
>
> Key: KAFKA-14127
> URL: https://issues.apache.org/jira/browse/KAFKA-14127
> Project: Kafka
>  Issue Type: Improvement
>  Components: jbod, kraft
>Reporter: Igor Soarez
>Assignee: Igor Soarez
>Priority: Major
>  Labels: 4.0-blocker, kip-500, kraft
> Fix For: 3.7.0
>
>
> Supporting configurations with multiple storage directories in KRaft mode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14127) KIP-858: Handle JBOD broker disk failure in KRaft

2023-12-04 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792858#comment-17792858
 ] 

Viktor Somogyi-Vass commented on KAFKA-14127:
-

[~soarez] is there anything that I can pick up from these that doesn't depend 
on other tickets? I'd like to help to speed up the completion of this.

> KIP-858: Handle JBOD broker disk failure in KRaft
> -
>
> Key: KAFKA-14127
> URL: https://issues.apache.org/jira/browse/KAFKA-14127
> Project: Kafka
>  Issue Type: Improvement
>  Components: jbod, kraft
>Reporter: Igor Soarez
>Assignee: Igor Soarez
>Priority: Major
>  Labels: 4.0-blocker, kip-500, kraft
> Fix For: 3.7.0
>
>
> Supporting configurations with multiple storage directories in KRaft mode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-8128) Dynamic JAAS change for clients

2023-11-30 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-8128:
---
Labels: admin clients consumer need-kip producer  (was: )

> Dynamic JAAS change for clients
> ---
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: admin, clients, consumer, need-kip, producer
>
> Clients using JAAS based authentication are now forced to restart themselves 
> in order to reload the JAAS configuration. We could
> - make the {{sasl.jaas.config}} dynamically configurable and therefore better 
> equip them to changing tokens etc.
> - detect file system level changes in configured JAAS and reload the context.
> Original issue:
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-8128) Dynamic JAAS change for clients

2023-11-30 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791679#comment-17791679
 ] 

Viktor Somogyi-Vass commented on KAFKA-8128:


Have a KIP ready but will also do a PoC implementation to make sure there will 
be no unexpected changes that would affect the KIP.

> Dynamic JAAS change for clients
> ---
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Clients using JAAS based authentication are now forced to restart themselves 
> in order to reload the JAAS configuration. We could
> - make the {{sasl.jaas.config}} dynamically configurable and therefore better 
> equip them to changing tokens etc.
> - detect file system level changes in configured JAAS and reload the context.
> Original issue:
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15362) Resolve offline replicas in metadata cache

2023-11-27 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790152#comment-17790152
 ] 

Viktor Somogyi-Vass commented on KAFKA-15362:
-

[~soarez] do you think this ticket can be resolved or do you expect more PRs?

> Resolve offline replicas in metadata cache 
> ---
>
> Key: KAFKA-15362
> URL: https://issues.apache.org/jira/browse/KAFKA-15362
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Igor Soarez
>Priority: Major
>
> Considering broker's offline log directories and replica to dir assignments



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-8820) kafka-reassign-partitions.sh should support the KIP-455 API

2023-10-17 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-8820:
---
Fix Version/s: 2.6.0
   (was: 2.5.0)

> kafka-reassign-partitions.sh should support the KIP-455 API
> ---
>
> Key: KAFKA-8820
> URL: https://issues.apache.org/jira/browse/KAFKA-8820
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Gwen Shapira
>Assignee: Colin McCabe
>Priority: Major
>  Labels: kip-500
> Fix For: 2.6.0
>
>
> KIP-455 and KAFKA-8345 add a protocol and AdminAPI that will be used for 
> replica reassignments. We need to update the reassignment tool to use this 
> new API rather than work with ZK directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-8820) kafka-reassign-partitions.sh should support the KIP-455 API

2023-10-17 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776151#comment-17776151
 ] 

Viktor Somogyi-Vass commented on KAFKA-8820:


Seems like the fix version is incorrect, it should be 2.6. I'll change it.
{noformat}
git log --oneline | grep KAFKA-8820
56051e7639 KAFKA-8820: kafka-reassign-partitions.sh should support the KIP-455 
API (#8244)
{noformat}

{noformat}
git branch -a --contains 56051e7639 | grep "upstream/[23]\.[0-9]" | cut -d'.' 
-f1-3 | sort -u
  remotes/upstream/2.6
  remotes/upstream/2.7
  remotes/upstream/2.8
  remotes/upstream/3.0
  remotes/upstream/3.1
  remotes/upstream/3.2
  remotes/upstream/3.3
  remotes/upstream/3.4
  remotes/upstream/3.5
  remotes/upstream/3.6
{noformat}

> kafka-reassign-partitions.sh should support the KIP-455 API
> ---
>
> Key: KAFKA-8820
> URL: https://issues.apache.org/jira/browse/KAFKA-8820
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Gwen Shapira
>Assignee: Colin McCabe
>Priority: Major
>  Labels: kip-500
> Fix For: 2.5.0
>
>
> KIP-455 and KAFKA-8345 add a protocol and AdminAPI that will be used for 
> replica reassignments. We need to update the reassignment tool to use this 
> new API rather than work with ZK directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-9569) RemoteStorageManager implementation for HDFS storage.

2023-09-18 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766427#comment-17766427
 ] 

Viktor Somogyi-Vass commented on KAFKA-9569:


[~Ying Zheng], [~satish.duggana] is this plugin available somewhere?

> RemoteStorageManager implementation for HDFS storage.
> -
>
> Key: KAFKA-9569
> URL: https://issues.apache.org/jira/browse/KAFKA-9569
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: Ying Zheng
>Priority: Major
>
> This is about implementing `RemoteStorageManager` for HDFS to verify the 
> proposed SPIs are sufficient. It looks like the existing RSM interface should 
> be sufficient. If needed, we will discuss any required changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15219) Support delegation tokens in KRaft

2023-08-22 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757246#comment-17757246
 ] 

Viktor Somogyi-Vass commented on KAFKA-15219:
-

[~pprovenzano] thanks a lot for your work! Unfortunately I couldn't get to the 
review in the past few days but I had no more comments anyway.

> Support delegation tokens in KRaft
> --
>
> Key: KAFKA-15219
> URL: https://issues.apache.org/jira/browse/KAFKA-15219
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Proven Provenzano
>Priority: Critical
> Fix For: 3.6.0
>
>
> Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
> enabled the way to supporting them in KIP-900 by adding SCRAM support but 
> delegation tokens still don't support KRaft.
> There are multiple issues:
> - TokenManager still would try to create tokens in Zookeeper. Instead of this 
> we should forward admin requests to the controller that would store them in 
> the metadata similarly to SCRAM. We probably won't need new protocols just 
> enveloping similarly to other existing controller requests.
> - TokenManager should run on Controller nodes only (or in mixed mode).
> - Integration tests will need to be adapted as well and parameterize them 
> with Zookeeper/KRaft.
> - Documentation needs to be improved to factor in KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-8128) Dynamic JAAS change for clients

2023-08-16 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755108#comment-17755108
 ] 

Viktor Somogyi-Vass commented on KAFKA-8128:


I think for clients (since there is no REST API and such) we can either make 
the client configs dynamic and/or detect a change in the jaas file and reload 
it from the FS.

> Dynamic JAAS change for clients
> ---
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Clients using JAAS based authentication are now forced to restart themselves 
> in order to reload the JAAS configuration. We could
> - make the {{sasl.jaas.config}} dynamically configurable and therefore better 
> equip them to changing tokens etc.
> - detect file system level changes in configured JAAS and reload the context.
> Original issue:
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-8128) Dynamic JAAS change for clients

2023-08-16 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-8128:
---
Description: 
Clients using JAAS based authentication are now forced to restart themselves in 
order to reload the JAAS configuration. We could
- make the {{sasl.jaas.config}} dynamically configurable and therefore better 
equip them to changing tokens etc.
- detect file system level changes in configured JAAS and reload the context.


Original issue:
Re-authentication feature on broker side is under implementation which will 
enforce consumer/producer instances to re-authenticate time to time. It would 
be good to set the latest delegation token dynamically and not re-creating 
consumer/producer instances.

  was:Re-authentication feature on broker side is under implementation which 
will enforce consumer/producer instances to re-authenticate time to time. It 
would be good to set the latest delegation token dynamically and not 
re-creating consumer/producer instances.


> Dynamic JAAS change for clients
> ---
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Clients using JAAS based authentication are now forced to restart themselves 
> in order to reload the JAAS configuration. We could
> - make the {{sasl.jaas.config}} dynamically configurable and therefore better 
> equip them to changing tokens etc.
> - detect file system level changes in configured JAAS and reload the context.
> Original issue:
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-8128) Dynamic JAAS change for clients

2023-08-16 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-8128:
---
Summary: Dynamic JAAS change for clients  (was: Dynamic delegation token 
change possibility for consumer/producer)

> Dynamic JAAS change for clients
> ---
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15219) Support delegation tokens in KRaft

2023-07-26 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747550#comment-17747550
 ] 

Viktor Somogyi-Vass commented on KAFKA-15219:
-

No problem, thanks for letting me know it in time. I can take a look at your PR 
on Monday.

> Support delegation tokens in KRaft
> --
>
> Key: KAFKA-15219
> URL: https://issues.apache.org/jira/browse/KAFKA-15219
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Proven Provenzano
>Priority: Critical
>
> Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
> enabled the way to supporting them in KIP-900 by adding SCRAM support but 
> delegation tokens still don't support KRaft.
> There are multiple issues:
> - TokenManager still would try to create tokens in Zookeeper. Instead of this 
> we should forward admin requests to the controller that would store them in 
> the metadata similarly to SCRAM. We probably won't need new protocols just 
> enveloping similarly to other existing controller requests.
> - TokenManager should run on Controller nodes only (or in mixed mode).
> - Integration tests will need to be adapted as well and parameterize them 
> with Zookeeper/KRaft.
> - Documentation needs to be improved to factor in KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15219) Support delegation tokens in KRaft

2023-07-19 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15219:

Description: 
Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
enabled the way to supporting them in KIP-900 by adding SCRAM support but 
delegation tokens still don't support KRaft.

There are multiple issues:
- TokenManager still would try to create tokens in Zookeeper. Instead of this 
we should forward admin requests to the controller that would store them in the 
metadata similarly to SCRAM. We probably won't need new protocols just 
enveloping similarly to other existing controller requests.
- TokenManager should run on Controller nodes only (or in mixed mode).
- Integration tests will need to be adapted as well and parameterize them with 
Zookeeper/KRaft.
- Documentation needs to be improved to factor in KRaft.

  was:
Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
enabled the way to supporting them in KIP-900 by adding SCRAM support but 
delegation tokens still don't support KRaft.

There are multiple issues:
- TokenManager still would try to create tokens in Zookeeper. Instead of this 
we should forward admin requests to the controller that would store them in the 
metadata similarly to SCRAM.
- TokenManager should run on Controller nodes only (or in mixed mode).
- Integration tests will need to be adapted as well and parameterize them with 
Zookeeper/KRaft.
- Documentation needs to be improved to factor in KRaft.


> Support delegation tokens in KRaft
> --
>
> Key: KAFKA-15219
> URL: https://issues.apache.org/jira/browse/KAFKA-15219
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Critical
>
> Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
> enabled the way to supporting them in KIP-900 by adding SCRAM support but 
> delegation tokens still don't support KRaft.
> There are multiple issues:
> - TokenManager still would try to create tokens in Zookeeper. Instead of this 
> we should forward admin requests to the controller that would store them in 
> the metadata similarly to SCRAM. We probably won't need new protocols just 
> enveloping similarly to other existing controller requests.
> - TokenManager should run on Controller nodes only (or in mixed mode).
> - Integration tests will need to be adapted as well and parameterize them 
> with Zookeeper/KRaft.
> - Documentation needs to be improved to factor in KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15219) Support delegation tokens in KRaft

2023-07-19 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-15219:
---

 Summary: Support delegation tokens in KRaft
 Key: KAFKA-15219
 URL: https://issues.apache.org/jira/browse/KAFKA-15219
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 3.6.0
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
enabled the way to supporting them in KIP-900 by adding SCRAM support but 
delegation tokens still don't support KRaft.

There are multiple issues:
- TokenManager still would try to create tokens in Zookeeper. Instead of this 
we should forward admin requests to the controller that would store them in the 
metadata similarly to SCRAM.
- TokenManager should run on Controller nodes only (or in mixed mode).
- Integration tests will need to be adapted as well and parameterize them with 
Zookeeper/KRaft.
- Documentation needs to be improved to factor in KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-8128) Dynamic delegation token change possibility for consumer/producer

2023-07-19 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744640#comment-17744640
 ] 

Viktor Somogyi-Vass commented on KAFKA-8128:


So after 4 years (yes) I think I'll pick this up. The problem came up more 
recently with OAuth but I believe the core problem here that JAAS contexts 
won't get reloaded. Making it a dynamic configuration would solve it but that 
probably requires a KIP too. I'll see what's up.
[~gsomogyi] does this seem correct? Did you have problems because you had to 
change the JAAS config or were there any other problems that you experienced?

> Dynamic delegation token change possibility for consumer/producer
> -
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-8128) Dynamic delegation token change possibility for consumer/producer

2023-07-19 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-8128:
--

Assignee: Viktor Somogyi-Vass

> Dynamic delegation token change possibility for consumer/producer
> -
>
> Key: KAFKA-8128
> URL: https://issues.apache.org/jira/browse/KAFKA-8128
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Re-authentication feature on broker side is under implementation which will 
> enforce consumer/producer instances to re-authenticate time to time. It would 
> be good to set the latest delegation token dynamically and not re-creating 
> consumer/producer instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-17 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 

[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-17 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a broker without the patch (applies to ZK mode)
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 

[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-10 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a broker without the patch (this can be reproduced in both ZK and 
KRaft mode)
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 

[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a zookeeper based broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 

[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a zookeeper based broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

Also this happens in Zookeeper based clusters only since KRaft defines a state 
machine for the brokers where it's not possible for this error to happen.

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 

[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a zookeeper based broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).
Also it seems like some full metadata requests succeed during startup and it's 
only the partial metadata request that fails, hence the first start of the 
broker with metadata and then the restart without it (to simulate this case).

  was:
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 

[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Attachment: empty_metadata.patch

> InvalidReplicationFactorException at connect startup
> 
>
> Key: KAFKA-15161
> URL: https://issues.apache.org/jira/browse/KAFKA-15161
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, KafkaConnect
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
> Attachments: empty_metadata.patch
>
>
> h2. Problem description
> In our system test environment in certain cases due to a very specific timing 
> issue Connect may fail to start up. the problem lies in the very specific 
> timing of a Kafka cluster and connect start/restart. In these cases while the 
> broker doesn't have metadata and a consumer in connect starts and asks for 
> topic metadata, it returns the following exception and fails:
> {noformat}
> [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
> groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> org.apache.kafka.common.KafkaException: Unexpected error fetching metadata 
> for topic connect-offsets
>   at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
>   at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
>   at 
> org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
>   at 
> org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
>   at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
>   at 
> org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
> Replication factor is below 1 or larger than the number of available brokers.
> {noformat}
> Due to this error the connect node stops and it has to be manually restarted 
> (and ofc it fails the test scenarios as well).
> h2. Reproduction
> In my test scenario I had:
> - 1 broker
> - 1 connect distributed node
> - I also had a patch that I applied on the broker to make sure we don't have 
> metadata
> Steps to repro:
> # start up a zookeeper based broker without the patch
> # put a breakpoint here: 
> https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
> # start up a distributed connect node
> # restart the kafka broker with the patch to make sure there is no metadata
> # once the broker is started, release the debugger in connect
> It should run into the error cited above and shut down.
> This is not desirable, the connect cluster should retry to ensure its 
> continuous operation or the broker should handle this case somehow 
> differently, for instance by returning a RetriableException.
> The earliest I've tried this is 2.8 but I think this affects versions before 
> that as well (and after).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Issue Type: Bug  (was: Improvement)

> InvalidReplicationFactorException at connect startup
> 
>
> Key: KAFKA-15161
> URL: https://issues.apache.org/jira/browse/KAFKA-15161
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, KafkaConnect
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> h2. Problem description
> In our system test environment in certain cases due to a very specific timing 
> issue Connect may fail to start up. the problem lies in the very specific 
> timing of a Kafka cluster and connect start/restart. In these cases while the 
> broker doesn't have metadata and a consumer in connect starts and asks for 
> topic metadata, it returns the following exception and fails:
> {noformat}
> [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
> groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> org.apache.kafka.common.KafkaException: Unexpected error fetching metadata 
> for topic connect-offsets
>   at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
>   at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
>   at 
> org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
>   at 
> org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
>   at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
>   at 
> org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
> Replication factor is below 1 or larger than the number of available brokers.
> {noformat}
> Due to this error the connect node stops and it has to be manually restarted 
> (and ofc it fails the test scenarios as well).
> h2. Reproduction
> In my test scenario I had:
> - 1 broker
> - 1 connect distributed node
> - I also had a patch that I applied on the broker to make sure we don't have 
> metadata
> Steps to repro:
> # start up a zookeeper based broker without the patch
> # put a breakpoint here: 
> https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
> # start up a distributed connect node
> # restart the kafka broker with the patch to make sure there is no metadata
> # once the broker is started, release the debugger in connect
> It should run into the error cited above and shut down.
> This is not desirable, the connect cluster should retry to ensure its 
> continuous operation or the broker should handle this case somehow 
> differently, for instance by returning a RetriableException.
> The earliest I've tried this is 2.8 but I think this affects versions before 
> that as well (and after).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-15161:

Description: 
h2. Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

h2. Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a zookeeper based broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).

  was:
.h2 Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at 

[jira] [Assigned] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-15161:
---

Assignee: Viktor Somogyi-Vass

> InvalidReplicationFactorException at connect startup
> 
>
> Key: KAFKA-15161
> URL: https://issues.apache.org/jira/browse/KAFKA-15161
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, KafkaConnect
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> h2. Problem description
> In our system test environment in certain cases due to a very specific timing 
> issue Connect may fail to start up. the problem lies in the very specific 
> timing of a Kafka cluster and connect start/restart. In these cases while the 
> broker doesn't have metadata and a consumer in connect starts and asks for 
> topic metadata, it returns the following exception and fails:
> {noformat}
> [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
> groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> org.apache.kafka.common.KafkaException: Unexpected error fetching metadata 
> for topic connect-offsets
>   at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
>   at 
> org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
>   at 
> org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
>   at 
> org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
>   at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
>   at 
> org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
>   at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
> Replication factor is below 1 or larger than the number of available brokers.
> {noformat}
> Due to this error the connect node stops and it has to be manually restarted 
> (and ofc it fails the test scenarios as well).
> h2. Reproduction
> In my test scenario I had:
> - 1 broker
> - 1 connect distributed node
> - I also had a patch that I applied on the broker to make sure we don't have 
> metadata
> Steps to repro:
> # start up a zookeeper based broker without the patch
> # put a breakpoint here: 
> https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
> # start up a distributed connect node
> # restart the kafka broker with the patch to make sure there is no metadata
> # once the broker is started, release the debugger in connect
> It should run into the error cited above and shut down.
> This is not desirable, the connect cluster should retry to ensure its 
> continuous operation or the broker should handle this case somehow 
> differently, for instance by returning a RetriableException.
> The earliest I've tried this is 2.8 but I think this affects versions before 
> that as well (and after).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15161) InvalidReplicationFactorException at connect startup

2023-07-07 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-15161:
---

 Summary: InvalidReplicationFactorException at connect startup
 Key: KAFKA-15161
 URL: https://issues.apache.org/jira/browse/KAFKA-15161
 Project: Kafka
  Issue Type: Improvement
  Components: clients, KafkaConnect
Affects Versions: 3.6.0
Reporter: Viktor Somogyi-Vass


.h2 Problem description

In our system test environment in certain cases due to a very specific timing 
issue Connect may fail to start up. the problem lies in the very specific 
timing of a Kafka cluster and connect start/restart. In these cases while the 
broker doesn't have metadata and a consumer in connect starts and asks for 
topic metadata, it returns the following exception and fails:
{noformat}
[2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, 
groupId=connect-cluster] Uncaught exception in herder work thread, exiting:  
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for 
topic connect-offsets
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130)
at 
org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969)
at 
org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251)
at 
org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242)
at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230)
at 
org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151)
at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor is below 1 or larger than the number of available brokers.
{noformat}

Due to this error the connect node stops and it has to be manually restarted 
(and ofc it fails the test scenarios as well).

.h2 Reproduction

In my test scenario I had:
- 1 broker
- 1 connect distributed node
- I also had a patch that I applied on the broker to make sure we don't have 
metadata

Steps to repro:
# start up a zookeeper based broker without the patch
# put a breakpoint here: 
https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94
# start up a distributed connect node
# restart the kafka broker with the patch to make sure there is no metadata
# once the broker is started, release the debugger in connect

It should run into the error cited above and shut down.

This is not desirable, the connect cluster should retry to ensure its 
continuous operation or the broker should handle this case somehow differently, 
for instance by returning a RetriableException.

The earliest I've tried this is 2.8 but I think this affects versions before 
that as well (and after).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14112) Expose replication-offset-lag Mirror metric

2023-06-21 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735719#comment-17735719
 ] 

Viktor Somogyi-Vass commented on KAFKA-14112:
-

Also I think we would need a small KIP for new metrics, so please create that 
too.

> Expose replication-offset-lag Mirror metric
> ---
>
> Key: KAFKA-14112
> URL: https://issues.apache.org/jira/browse/KAFKA-14112
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Elkhan Eminov
>Assignee: Elkhan Eminov
>Priority: Minor
>
> The offset lag is the difference of the last replicated record's (LRO) source 
> offset and the end offset of the source (LEO).
> The offset lag is a difference (LRO-LEO), but its constituents calculated at 
> different points of time and place
>  * LEO shall be calculated during source task's poll loop (ready to get it 
> from the consumer)
>  * LRO shall be kept in an in-memory "cache", that is updated during the 
> task's producer callback
> LRO is initialized when task is started, from the offset store. The 
> difference shall be calculated when the freshest LEO acquired
> in the poll loop. The calculated amount shall be defined as a MirrorMaker 
> metric.
> This would describe to amount of "to be replicated" number of records for a 
> certain topic-partition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14112) Expose replication-offset-lag Mirror metric

2023-06-21 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735718#comment-17735718
 ] 

Viktor Somogyi-Vass commented on KAFKA-14112:
-

Yea, it's LEO-LRO. btw it's not contributed back yet so if you feel so, I'm 
happy to review your PR if you implement it.

> Expose replication-offset-lag Mirror metric
> ---
>
> Key: KAFKA-14112
> URL: https://issues.apache.org/jira/browse/KAFKA-14112
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Elkhan Eminov
>Assignee: Elkhan Eminov
>Priority: Minor
>
> The offset lag is the difference of the last replicated record's (LRO) source 
> offset and the end offset of the source (LEO).
> The offset lag is a difference (LRO-LEO), but its constituents calculated at 
> different points of time and place
>  * LEO shall be calculated during source task's poll loop (ready to get it 
> from the consumer)
>  * LRO shall be kept in an in-memory "cache", that is updated during the 
> task's producer callback
> LRO is initialized when task is started, from the offset store. The 
> difference shall be calculated when the freshest LEO acquired
> in the poll loop. The calculated amount shall be defined as a MirrorMaker 
> metric.
> This would describe to amount of "to be replicated" number of records for a 
> certain topic-partition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15059) Exactly-once source tasks fail to start during pending rebalances

2023-06-21 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-15059.
-
Resolution: Fixed

[~ChrisEgerton] since the PR is merged, I resolve this ticket.

> Exactly-once source tasks fail to start during pending rebalances
> -
>
> Key: KAFKA-15059
> URL: https://issues.apache.org/jira/browse/KAFKA-15059
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 3.6.0
>Reporter: Chris Egerton
>Assignee: Chris Egerton
>Priority: Blocker
> Fix For: 3.6.0
>
>
> When asked to perform a round of zombie fencing, the distributed herder will 
> [reject the 
> request|https://github.com/apache/kafka/blob/17fd30e6b457f097f6a524b516eca1a6a74a9144/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1249-L1250]
>  if a rebalance is pending, which can happen if (among other things) a config 
> for a new connector or a new set of task configs has been recently read from 
> the config topic.
> Normally this can be alleviated with a simple task restart, which isn't great 
> but isn't terrible.
> However, when running MirrorMaker 2 in dedicated mode, there is no API to 
> restart failed tasks, and it can be more common to see this kind of failure 
> on a fresh cluster because three connector configurations are written in 
> rapid succession to the config topic.
>  
> In order to provide a better experience for users of both vanilla Kafka 
> Connect and dedicated MirrorMaker 2 clusters, we can retry (likely with the 
> same exponential backoff introduced with KAFKA-14732) zombie fencing attempts 
> that fail due to a pending rebalance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-12384) Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch

2023-05-24 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-12384.
-
Fix Version/s: 3.6.0
   Resolution: Fixed

> Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch
> -
>
> Key: KAFKA-12384
> URL: https://issues.apache.org/jira/browse/KAFKA-12384
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Chia-Ping Tsai
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.6.0, 3.0.0
>
>
> {quote}org.opentest4j.AssertionFailedError: expected: <(0,0)> but was: 
> <(-1,-1)> at 
> org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at 
> org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1124) at 
> kafka.server.ListOffsetsRequestTest.testResponseIncludesLeaderEpoch(ListOffsetsRequestTest.scala:172){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-12384) Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch

2023-05-24 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725761#comment-17725761
 ] 

Viktor Somogyi-Vass commented on KAFKA-12384:
-

Ok, [~dajac] got to it sooner, resolving this.

> Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch
> -
>
> Key: KAFKA-12384
> URL: https://issues.apache.org/jira/browse/KAFKA-12384
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Chia-Ping Tsai
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> {quote}org.opentest4j.AssertionFailedError: expected: <(0,0)> but was: 
> <(-1,-1)> at 
> org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at 
> org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1124) at 
> kafka.server.ListOffsetsRequestTest.testResponseIncludesLeaderEpoch(ListOffsetsRequestTest.scala:172){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-13504) Retry connect internal topics' creation in case of InvalidReplicationFactorException

2023-05-24 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-13504:

Description: 
In case the Kafka Broker cluster and the Kafka Connect cluster is started 
together and Connect would want to create its topics, there's a high chance to 
fail the creation with InvalidReplicationFactorException.
{noformat}
ERROR org.apache.kafka.connect.runtime.distributed.DistributedHerder [Worker 
clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder work 
thread, exiting: 
org.apache.kafka.connect.errors.ConnectException: Error while attempting to 
create/find topic(s) 'connect-offsets'
...
Caused by: java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
factor: 3 larger than available brokers: 2.
{noformat}
Introducing a retry logic here would make Connect a bit more robust.

The commit uses {{default.api.timeout.ms}} and {{retry.backoff.ms}} configs to 
control the retry mechanism.


  was:
In case the Kafka Broker cluster and the Kafka Connect cluster is started 
together and Connect would want to create its topics, there's a high chance to 
fail the creation with InvalidReplicationFactorException.
{noformat}
ERROR org.apache.kafka.connect.runtime.distributed.DistributedHerder [Worker 
clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder work 
thread, exiting: 
org.apache.kafka.connect.errors.ConnectException: Error while attempting to 
create/find topic(s) 'connect-offsets'
...
Caused by: java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
factor: 3 larger than available brokers: 2.
{noformat}
Introducing a retry logic here would make Connect a bit more robust.

New configurations:
* offset.storage.topic.create.retries
* offset.storage.topic.create.retry.backoff.ms
* config.storage.topic.create.retries
* config.storage.topic.create.retry.backoff.ms
* status.storage.topic.create.retries
* status.storage.topic.create.retry.backoff.ms



> Retry connect internal topics' creation in case of 
> InvalidReplicationFactorException
> 
>
> Key: KAFKA-13504
> URL: https://issues.apache.org/jira/browse/KAFKA-13504
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Andras Katona
>Assignee: Andras Katona
>Priority: Major
>  Labels: cloudera
>
> In case the Kafka Broker cluster and the Kafka Connect cluster is started 
> together and Connect would want to create its topics, there's a high chance 
> to fail the creation with InvalidReplicationFactorException.
> {noformat}
> ERROR org.apache.kafka.connect.runtime.distributed.DistributedHerder [Worker 
> clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder 
> work thread, exiting: 
> org.apache.kafka.connect.errors.ConnectException: Error while attempting to 
> create/find topic(s) 'connect-offsets'
> ...
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 2.
> {noformat}
> Introducing a retry logic here would make Connect a bit more robust.
> The commit uses {{default.api.timeout.ms}} and {{retry.backoff.ms}} configs 
> to control the retry mechanism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-12384) Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch

2023-05-24 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725703#comment-17725703
 ] 

Viktor Somogyi-Vass commented on KAFKA-12384:
-

[~satish.duggana] I checked out recent commits and ran this tests and it seems 
like 
https://github.com/apache/kafka/commit/6f197301646135e0bb39a461ca0a07c09c3185fb 
introduces the flaky test. Would you please take a look at it if you have some 
time? Let me know if you need reviews, happy to do it.

> Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch
> -
>
> Key: KAFKA-12384
> URL: https://issues.apache.org/jira/browse/KAFKA-12384
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Chia-Ping Tsai
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> {quote}org.opentest4j.AssertionFailedError: expected: <(0,0)> but was: 
> <(-1,-1)> at 
> org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at 
> org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1124) at 
> kafka.server.ListOffsetsRequestTest.testResponseIncludesLeaderEpoch(ListOffsetsRequestTest.scala:172){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-12384) Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch

2023-05-24 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725672#comment-17725672
 ] 

Viktor Somogyi-Vass commented on KAFKA-12384:
-

Found another one in 
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11565/16/tests/

> Flaky Test ListOffsetsRequestTest.testResponseIncludesLeaderEpoch
> -
>
> Key: KAFKA-12384
> URL: https://issues.apache.org/jira/browse/KAFKA-12384
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Chia-Ping Tsai
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> {quote}org.opentest4j.AssertionFailedError: expected: <(0,0)> but was: 
> <(-1,-1)> at 
> org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at 
> org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1124) at 
> kafka.server.ListOffsetsRequestTest.testResponseIncludesLeaderEpoch(ListOffsetsRequestTest.scala:172){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-13337) Scanning for Connect plugins can fail with AccessDeniedException

2023-05-19 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-13337:
---

Assignee: Andras Katona  (was: Tamás Héri)

> Scanning for Connect plugins can fail with AccessDeniedException
> 
>
> Key: KAFKA-13337
> URL: https://issues.apache.org/jira/browse/KAFKA-13337
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.7.1, 2.6.2, 3.1.0, 2.8.1
>Reporter: Tamás Héri
>Assignee: Andras Katona
>Priority: Minor
>
> During Connect plugin path scan, if an unreadable file/directory is found, 
> Connect will fail with an {{AccessDeniedException}}. As the directories/files 
> can be unreadable, it is best to skip them in this case. See referenced PR.
>  
> {noformat}
> java.nio.file.AccessDeniedException: 
> /tmp/junit8905851398112785578/plugins/.protected
>   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:432)
>   at java.base/java.nio.file.Files.newDirectoryStream(Files.java:604)
>   at 
> org.apache.kafka.connect.runtime.isolation.PluginUtils.pluginUrls(PluginUtils.java:276)
>   at 
> org.apache.kafka.connect.runtime.isolation.PluginUtilsTest.testPluginUrlsWithProtectedDirectory(PluginUtilsTest.java:481)
> ...
> {noformat}
> Connect server fails with the following exception, (I created an "aaa" 
> directory only readable by root
> {noformat}
> Could not get listing for plugin path: /var/lib/kafka. Ignoring.
> java.nio.file.AccessDeniedException: /var/lib/kafka/
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427)
>   at java.nio.file.Files.newDirectoryStream(Files.java:589)
>   at 
> org.apache.kafka.connect.runtime.isolation.PluginUtils.pluginUrls(PluginUtils.java:231)
>   at 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:241)
>   at 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initPluginLoader(DelegatingClassLoader.java:222)
>   at 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:199)
>   at 
> org.apache.kafka.connect.runtime.isolation.Plugins.(Plugins.java:60)
>   at 
> org.apache.kafka.connect.cli.ConnectDistributed.startConnect(ConnectDistributed.java:91)
> {noformat}
> Additional note:
> Connect server would not stop normally but an extension couldn't be found 
> because of this in my case which killed connect at later point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-13337) Scanning for Connect plugins can fail with AccessDeniedException

2023-05-19 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724260#comment-17724260
 ] 

Viktor Somogyi-Vass commented on KAFKA-13337:
-

[~akatona] took this task over in a new PR: 
https://github.com/apache/kafka/pull/13733

> Scanning for Connect plugins can fail with AccessDeniedException
> 
>
> Key: KAFKA-13337
> URL: https://issues.apache.org/jira/browse/KAFKA-13337
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.7.1, 2.6.2, 3.1.0, 2.8.1
>Reporter: Tamás Héri
>Assignee: Andras Katona
>Priority: Minor
>
> During Connect plugin path scan, if an unreadable file/directory is found, 
> Connect will fail with an {{AccessDeniedException}}. As the directories/files 
> can be unreadable, it is best to skip them in this case. See referenced PR.
>  
> {noformat}
> java.nio.file.AccessDeniedException: 
> /tmp/junit8905851398112785578/plugins/.protected
>   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:432)
>   at java.base/java.nio.file.Files.newDirectoryStream(Files.java:604)
>   at 
> org.apache.kafka.connect.runtime.isolation.PluginUtils.pluginUrls(PluginUtils.java:276)
>   at 
> org.apache.kafka.connect.runtime.isolation.PluginUtilsTest.testPluginUrlsWithProtectedDirectory(PluginUtilsTest.java:481)
> ...
> {noformat}
> Connect server fails with the following exception, (I created an "aaa" 
> directory only readable by root
> {noformat}
> Could not get listing for plugin path: /var/lib/kafka. Ignoring.
> java.nio.file.AccessDeniedException: /var/lib/kafka/
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427)
>   at java.nio.file.Files.newDirectoryStream(Files.java:589)
>   at 
> org.apache.kafka.connect.runtime.isolation.PluginUtils.pluginUrls(PluginUtils.java:231)
>   at 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:241)
>   at 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initPluginLoader(DelegatingClassLoader.java:222)
>   at 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:199)
>   at 
> org.apache.kafka.connect.runtime.isolation.Plugins.(Plugins.java:60)
>   at 
> org.apache.kafka.connect.cli.ConnectDistributed.startConnect(ConnectDistributed.java:91)
> {noformat}
> Additional note:
> Connect server would not stop normally but an extension couldn't be found 
> because of this in my case which killed connect at later point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14978) ExactlyOnceWorkerSourceTask does not remove parent metrics

2023-05-19 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-14978:

Fix Version/s: 3.5.0
   3.3.3

> ExactlyOnceWorkerSourceTask does not remove parent metrics
> --
>
> Key: KAFKA-14978
> URL: https://issues.apache.org/jira/browse/KAFKA-14978
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Daniel Urban
>Assignee: Daniel Urban
>Priority: Major
> Fix For: 3.5.0, 3.4.1, 3.3.3, 3.6.0
>
>
> ExactlyOnceWorkerSourceTask removeMetrics does not invoke 
> super.removeMetrics, meaning that only the transactional metrics are removed, 
> and common source task metrics are not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14929) Flaky KafkaStatusBackingStoreFormatTest#putTopicStateRetriableFailure

2023-04-27 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-14929.
-
Resolution: Fixed

> Flaky KafkaStatusBackingStoreFormatTest#putTopicStateRetriableFailure
> -
>
> Key: KAFKA-14929
> URL: https://issues.apache.org/jira/browse/KAFKA-14929
> Project: Kafka
>  Issue Type: Test
>  Components: KafkaConnect
>Reporter: Greg Harris
>Assignee: Sagar Rao
>Priority: Major
>  Labels: flaky-test
> Fix For: 3.5.0
>
>
> This test recently started flaky-failing with the following stack trace:
> {noformat}
> org.mockito.exceptions.verification.TooFewActualInvocations: 
> kafkaBasedLog.send(, , );
> Wanted 2 times:->
>  at org.apache.kafka.connect.util.KafkaBasedLog.send(KafkaBasedLog.java:376)
> But was 1 time:->
>  at 
> org.apache.kafka.connect.storage.KafkaStatusBackingStore.sendTopicStatus(KafkaStatusBackingStore.java:315)
>   at 
> app//org.apache.kafka.connect.util.KafkaBasedLog.send(KafkaBasedLog.java:376)
>   at 
> app//org.apache.kafka.connect.storage.KafkaStatusBackingStoreFormatTest.putTopicStateRetriableFailure(KafkaStatusBackingStoreFormatTest.java:219)
>   at 
> java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>  Method)
>   at 
> java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base@11.0.16.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base@11.0.16.1/java.lang.reflect.Method.invoke(Method.java:566)
> ...{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14929) Flaky KafkaStatusBackingStoreFormatTest#putTopicStateRetriableFailure

2023-04-22 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-14929:
---

Assignee: Viktor Somogyi-Vass

> Flaky KafkaStatusBackingStoreFormatTest#putTopicStateRetriableFailure
> -
>
> Key: KAFKA-14929
> URL: https://issues.apache.org/jira/browse/KAFKA-14929
> Project: Kafka
>  Issue Type: Test
>  Components: KafkaConnect
>Reporter: Greg Harris
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: flaky-test
> Fix For: 3.5.0
>
>
> This test recently started flaky-failing with the following stack trace:
> {noformat}
> org.mockito.exceptions.verification.TooFewActualInvocations: 
> kafkaBasedLog.send(, , );
> Wanted 2 times:->
>  at org.apache.kafka.connect.util.KafkaBasedLog.send(KafkaBasedLog.java:376)
> But was 1 time:->
>  at 
> org.apache.kafka.connect.storage.KafkaStatusBackingStore.sendTopicStatus(KafkaStatusBackingStore.java:315)
>   at 
> app//org.apache.kafka.connect.util.KafkaBasedLog.send(KafkaBasedLog.java:376)
>   at 
> app//org.apache.kafka.connect.storage.KafkaStatusBackingStoreFormatTest.putTopicStateRetriableFailure(KafkaStatusBackingStoreFormatTest.java:219)
>   at 
> java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>  Method)
>   at 
> java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base@11.0.16.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base@11.0.16.1/java.lang.reflect.Method.invoke(Method.java:566)
> ...{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14112) Expose replication-offset-lag Mirror metric

2023-03-10 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698824#comment-17698824
 ] 

Viktor Somogyi-Vass commented on KAFKA-14112:
-

[~elkkhan] do you have a PR for this, are you planning still planning to open 
one?

> Expose replication-offset-lag Mirror metric
> ---
>
> Key: KAFKA-14112
> URL: https://issues.apache.org/jira/browse/KAFKA-14112
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Elkhan Eminov
>Assignee: Elkhan Eminov
>Priority: Minor
>
> The offset lag is the difference of the last replicated record's source 
> offset and the end offset of the source.
> The offset lag is a difference (LRO-LEO), but its constituents calculated at 
> different points of time and place
>  * LEO shall be calculated during source task's poll loop (ready to get it 
> from the consumer)
>  * LRO shall be kept in an in-memory "cache", that is updated during the 
> task's producer callback
> LRO is initialized when task is started, from the offset store. The 
> difference shall be calculated when the freshest LEO acquired
> in the poll loop. The calculated amount shall be defined as a MirrorMaker 
> metric.
> This would describe to amount of "to be replicated" number of records for a 
> certain topic-partition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14667) Delayed leader election operation gets stuck in purgatory

2023-02-02 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-14667:
---

Assignee: Viktor Somogyi-Vass

> Delayed leader election operation gets stuck in purgatory
> -
>
> Key: KAFKA-14667
> URL: https://issues.apache.org/jira/browse/KAFKA-14667
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> This was observed with Kafka 3.1.1, but I believe that latest versions are 
> also affected.
> In the Cruise Control project, there is an integration test: 
> com.linkedin.kafka.cruisecontrol.executor.ExecutorTest#testReplicaReassignmentProgressWithThrottle
> On our infrastructure, this test fails every ~20th run with a timeout - the 
> triggered preferred leadership election is never completed. After some 
> investigation, it turns out that:
>  # The admin client never gets a response from the broker.
>  # The leadership change is executed successfully.
>  # The ElectLeader purgatory never gets an update for the relevant topic 
> partition.
> A few relevant lines from a failed run (this test uses an embedded cluster, 
> logs are mixed):
> CC successfully sends a preferred election request to the controller (broker 
> 0), topic1-0 needs a leadership change from broker 0 to broker 1:
> {code:java}
> 2023-02-01 01:20:26.028 [controller-event-thread] DEBUG 
> kafka.controller.KafkaController - [Controller id=0] Waiting for any 
> successful result for election type (PREFERRED) by AdminClientTriggered for 
> partitions: Map(topic1-0 -> Right(1), topic0-0 -> 
> Left(ApiError(error=ELECTION_NOT_NEEDED, message=Leader election not needed 
> for topic partition.)))
> 2023-02-01 01:20:26.031 [controller-event-thread] DEBUG 
> kafka.server.DelayedElectLeader - tryComplete() waitingPartitions: 
> HashMap(topic1-0 -> 1) {code}
> The delayed operation for the leader election is triggered 2 times in quick 
> succession (yes, same ms in both logs):
> {code:java}
> 2023-02-01 01:20:26.031 [controller-event-thread] DEBUG 
> kafka.server.DelayedElectLeader - tryComplete() waitingPartitions: 
> HashMap(topic1-0 -> 1)
> 2023-02-01 01:20:26.031 [controller-event-thread] DEBUG 
> kafka.server.DelayedElectLeader - tryComplete() waitingPartitions: 
> HashMap(topic1-0 -> 1){code}
> Shortly after (few ms later based on the logs), broker 0 receives an 
> UpdateMetadataRequest from the controller (itself) and processes it:
> {code:java}
> 2023-02-01 01:20:26.033 [Controller-0-to-broker-0-send-thread] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Controller id=0, targetBrokerId=0] 
> Sending UPDATE_METADATA request with header 
> RequestHeader(apiKey=UPDATE_METADATA, apiVersion=7, clientId=0, 
> correlationId=19) and timeout 3 to node 0: 
> UpdateMetadataRequestData(controllerId=0, controllerEpoch=1, brokerEpoch=25, 
> ungroupedPartitionStates=[], 
> topicStates=[UpdateMetadataTopicState(topicName='topic1', 
> topicId=gkFP8VnkSGyEf_LBBZSowQ, 
> partitionStates=[UpdateMetadataPartitionState(topicName='topic1', 
> partitionIndex=0, controllerEpoch=1, leader=1, leaderEpoch=2, isr=[0, 1], 
> zkVersion=2, replicas=[1, 0], offlineReplicas=[])])], 
> liveBrokers=[UpdateMetadataBroker(id=1, v0Host='', v0Port=0, 
> endpoints=[UpdateMetadataEndpoint(port=40236, host='localhost', 
> listener='PLAINTEXT', securityProtocol=0)], rack=null), 
> UpdateMetadataBroker(id=0, v0Host='', v0Port=0, 
> endpoints=[UpdateMetadataEndpoint(port=42556, host='localhost', 
> listener='PLAINTEXT', securityProtocol=0)], rack=null)])
> 2023-02-01 01:20:26.035 [Controller-0-to-broker-0-send-thread] DEBUG 
> org.apache.kafka.clients.NetworkClient - [Controller id=0, targetBrokerId=0] 
> Received UPDATE_METADATA response from node 0 for request with header 
> RequestHeader(apiKey=UPDATE_METADATA, apiVersion=7, clientId=0, 
> correlationId=19): UpdateMetadataResponseData(errorCode=0)
> 2023-02-01 01:20:26.035 
> [data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
> kafka.request.logger - Completed 
> 

[jira] [Updated] (KAFKA-14281) Multi-level rack awareness

2022-10-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-14281:

Labels: cloudera  (was: )

> Multi-level rack awareness
> --
>
> Key: KAFKA-14281
> URL: https://issues.apache.org/jira/browse/KAFKA-14281
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
>
> h1. Motivation
> With replication services data can be replicated across independent Kafka 
> clusters in multiple data center. In addition, many customers need "stretch 
> clusters" - a single Kafka cluster that spans across multiple data centers. 
> This architecture has the following useful characteristics:
>  - Data is natively replicated into all data centers by Kafka topic 
> replication.
>  - No data is lost when 1 DC is lost and no configuration change is required 
> - design is implicitly relying on native Kafka replication.
>  - From operational point of view, it is much easier to configure and operate 
> such a topology than a replication scenario via MM2.
> Kafka should provide "native" support for stretch clusters, covering any 
> special aspects of operations of stretch cluster.
> h2. Multi-level rack awareness
> Additionally, stretch clusters are implemented using the rack awareness 
> feature, where each DC is represented as a rack. This ensures that replicas 
> are spread across DCs evenly. Unfortunately, there are cases where this is 
> too limiting - in case there are actual racks inside the DCs, we cannot 
> specify those. Consider having 3 DCs with 2 racks each:
> /DC1/R1, /DC1/R2
> /DC2/R1, /DC2/R2
> /DC3/R1, /DC3/R2
> If we were to use racks as DC1, DC2, DC3, we lose the rack-level information 
> of the setup. This means that it is possible that when we are using RF=6, 
> that the 2 replicas assigned to DC1 will both end up in the same rack.
> If we were to use racks as /DC1/R1, /DC1/R2, etc, then when using RF=3, it is 
> possible that 2 replicas end up in the same DC, e.g. /DC1/R1, /DC1/R2, 
> /DC2/R1.
> Because of this, Kafka should support "multi-level" racks, which means that 
> rack IDs should be able to describe some kind of a hierarchy. With this 
> feature, brokers should be able to:
>  # spread replicas evenly based on the top level of the hierarchy (i.e. 
> first, between DCs)
>  # then inside a top-level unit (DC), if there are multiple replicas, they 
> should be spread evenly among lower-level units (i.e. between racks, then 
> between physical hosts, and so on)
>  ## repeat for all levels



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14250) Exception during normal operation in MirrorSourceTask causes the task to fail instead of shutting down gracefully

2022-10-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-14250:

Labels: cloudera  (was: )

> Exception during normal operation in MirrorSourceTask causes the task to fail 
> instead of shutting down gracefully
> -
>
> Key: KAFKA-14250
> URL: https://issues.apache.org/jira/browse/KAFKA-14250
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 3.3
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
> Attachments: mm2.log
>
>
> In MirrorSourceTask we are loading offsets for the topic partitions. At this 
> point, while we are fetching the partitions, it is possible for the offset 
> reader to be stopped by a parallel thread. Stopping the reader causes a 
> CancellationException to be thrown, due to KAFKA-9051.
> Currently this exception is not caught in MirrorSourceTask and so the 
> exception propagates up and causes the task to go into FAILED state. We only 
> need it to go to STOPPED state so that it would be restarted later.
> This can be achieved by catching the exception and stopping the task directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14331) Upgrade to Scala 2.13.10

2022-10-25 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623677#comment-17623677
 ] 

Viktor Somogyi-Vass commented on KAFKA-14331:
-

Thanks [~mimaison], I'll resolve this as a duplicate then.

> Upgrade to Scala 2.13.10
> 
>
> Key: KAFKA-14331
> URL: https://issues.apache.org/jira/browse/KAFKA-14331
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.0
>Reporter: Viktor Somogyi-Vass
>Priority: Major
>
> There are some CVEs in Scala 2.13.8, so we should upgrade to the latest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14331) Upgrade to Scala 2.13.10

2022-10-25 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-14331.
-
Resolution: Duplicate

> Upgrade to Scala 2.13.10
> 
>
> Key: KAFKA-14331
> URL: https://issues.apache.org/jira/browse/KAFKA-14331
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.0
>Reporter: Viktor Somogyi-Vass
>Priority: Major
>
> There are some CVEs in Scala 2.13.8, so we should upgrade to the latest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14331) Upgrade to Scala 2.13.10

2022-10-24 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-14331:
---

 Summary: Upgrade to Scala 2.13.10
 Key: KAFKA-14331
 URL: https://issues.apache.org/jira/browse/KAFKA-14331
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 3.4.0
Reporter: Viktor Somogyi-Vass


There are some CVEs in Scala 2.13.8, so we should upgrade to the latest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14281) Multi-level rack awareness

2022-10-13 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617017#comment-17617017
 ] 

Viktor Somogyi-Vass commented on KAFKA-14281:
-

Thanks [~ottomata] for the linked evaluation, I think it's useful material. 
We're currently preparing our KIP, I'll hopefully publish it in a few days. 
We'll create an algorithm that'd spread evenly across multiple levels so it'd 
handle even assignments between racks. If you use Cruise Control I think we'll 
publish an optimization goal there too if the planned KIP gets adopted.

> Multi-level rack awareness
> --
>
> Key: KAFKA-14281
> URL: https://issues.apache.org/jira/browse/KAFKA-14281
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.4.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> h1. Motivation
> With replication services data can be replicated across independent Kafka 
> clusters in multiple data center. In addition, many customers need "stretch 
> clusters" - a single Kafka cluster that spans across multiple data centers. 
> This architecture has the following useful characteristics:
>  - Data is natively replicated into all data centers by Kafka topic 
> replication.
>  - No data is lost when 1 DC is lost and no configuration change is required 
> - design is implicitly relying on native Kafka replication.
>  - From operational point of view, it is much easier to configure and operate 
> such a topology than a replication scenario via MM2.
> Kafka should provide "native" support for stretch clusters, covering any 
> special aspects of operations of stretch cluster.
> h2. Multi-level rack awareness
> Additionally, stretch clusters are implemented using the rack awareness 
> feature, where each DC is represented as a rack. This ensures that replicas 
> are spread across DCs evenly. Unfortunately, there are cases where this is 
> too limiting - in case there are actual racks inside the DCs, we cannot 
> specify those. Consider having 3 DCs with 2 racks each:
> /DC1/R1, /DC1/R2
> /DC2/R1, /DC2/R2
> /DC3/R1, /DC3/R2
> If we were to use racks as DC1, DC2, DC3, we lose the rack-level information 
> of the setup. This means that it is possible that when we are using RF=6, 
> that the 2 replicas assigned to DC1 will both end up in the same rack.
> If we were to use racks as /DC1/R1, /DC1/R2, etc, then when using RF=3, it is 
> possible that 2 replicas end up in the same DC, e.g. /DC1/R1, /DC1/R2, 
> /DC2/R1.
> Because of this, Kafka should support "multi-level" racks, which means that 
> rack IDs should be able to describe some kind of a hierarchy. With this 
> feature, brokers should be able to:
>  # spread replicas evenly based on the top level of the hierarchy (i.e. 
> first, between DCs)
>  # then inside a top-level unit (DC), if there are multiple replicas, they 
> should be spread evenly among lower-level units (i.e. between racks, then 
> between physical hosts, and so on)
>  ## repeat for all levels



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14281) Multi-level rack awareness

2022-10-05 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-14281:
---

 Summary: Multi-level rack awareness
 Key: KAFKA-14281
 URL: https://issues.apache.org/jira/browse/KAFKA-14281
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 3.4.0
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


h1. Motivation

With replication services data can be replicated across independent Kafka 
clusters in multiple data center. In addition, many customers need "stretch 
clusters" - a single Kafka cluster that spans across multiple data centers. 
This architecture has the following useful characteristics:
 - Data is natively replicated into all data centers by Kafka topic replication.
 - No data is lost when 1 DC is lost and no configuration change is required - 
design is implicitly relying on native Kafka replication.
 - From operational point of view, it is much easier to configure and operate 
such a topology than a replication scenario via MM2.

Kafka should provide "native" support for stretch clusters, covering any 
special aspects of operations of stretch cluster.

h2. Multi-level rack awareness

Additionally, stretch clusters are implemented using the rack awareness 
feature, where each DC is represented as a rack. This ensures that replicas are 
spread across DCs evenly. Unfortunately, there are cases where this is too 
limiting - in case there are actual racks inside the DCs, we cannot specify 
those. Consider having 3 DCs with 2 racks each:

/DC1/R1, /DC1/R2
/DC2/R1, /DC2/R2
/DC3/R1, /DC3/R2

If we were to use racks as DC1, DC2, DC3, we lose the rack-level information of 
the setup. This means that it is possible that when we are using RF=6, that the 
2 replicas assigned to DC1 will both end up in the same rack.

If we were to use racks as /DC1/R1, /DC1/R2, etc, then when using RF=3, it is 
possible that 2 replicas end up in the same DC, e.g. /DC1/R1, /DC1/R2, /DC2/R1.

Because of this, Kafka should support "multi-level" racks, which means that 
rack IDs should be able to describe some kind of a hierarchy. With this 
feature, brokers should be able to:
 # spread replicas evenly based on the top level of the hierarchy (i.e. first, 
between DCs)
 # then inside a top-level unit (DC), if there are multiple replicas, they 
should be spread evenly among lower-level units (i.e. between racks, then 
between physical hosts, and so on)
 ## repeat for all levels



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14250) Exception during normal operation in MirrorSourceTask causes the task to fail instead of shutting down gracefully

2022-09-27 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-14250:

Attachment: mm2.log

> Exception during normal operation in MirrorSourceTask causes the task to fail 
> instead of shutting down gracefully
> -
>
> Key: KAFKA-14250
> URL: https://issues.apache.org/jira/browse/KAFKA-14250
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 3.3
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
> Attachments: mm2.log
>
>
> In MirrorSourceTask we are loading offsets for the topic partitions. At this 
> point, while we are fetching the partitions, it is possible for the offset 
> reader to be stopped by a parallel thread. Stopping the reader causes a 
> CancellationException to be thrown, due to KAFKA-9051.
> Currently this exception is not caught in MirrorSourceTask and so the 
> exception propagates up and causes the task to go into FAILED state. We only 
> need it to go to STOPPED state so that it would be restarted later.
> This can be achieved by catching the exception and stopping the task directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14250) Exception during normal operation in MirrorSourceTask causes the task to fail instead of shutting down gracefully

2022-09-21 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-14250:
---

 Summary: Exception during normal operation in MirrorSourceTask 
causes the task to fail instead of shutting down gracefully
 Key: KAFKA-14250
 URL: https://issues.apache.org/jira/browse/KAFKA-14250
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect, mirrormaker
Affects Versions: 3.3
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


In MirrorSourceTask we are loading offsets for the topic partitions. At this 
point, while we are fetching the partitions, it is possible for the offset 
reader to be stopped by a parallel thread. Stopping the reader causes a 
CancellationException to be thrown, due to KAFKA-9051.

Currently this exception is not caught in MirrorSourceTask and so the exception 
propagates up and causes the task to go into FAILED state. We only need it to 
go to STOPPED state so that it would be restarted later.

This can be achieved by catching the exception and stopping the task directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-13917) Avoid calling lookupCoordinator() in tight loop

2022-07-21 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-13917:

Affects Version/s: 3.3.0
   3.2.1

> Avoid calling lookupCoordinator() in tight loop
> ---
>
> Key: KAFKA-13917
> URL: https://issues.apache.org/jira/browse/KAFKA-13917
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Affects Versions: 3.1.0, 3.1.1, 3.3.0, 3.1.2, 3.2.1
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently the heartbeat thread's lookupCoordinator() is called in a tight 
> loop if brokers crash and the consumer is left running. Besides that it 
> floods the logs on debug level, it increases CPU usage as well.
> The fix is easy, just need to put a backoff call after coordinator lookup.
> Reproduction:
> # Start a few brokers
> # Create a topic and produce to it
> # Start consuming
> # Stop all brokers
> At this point lookupCoordinator() will be called in a tight loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-6945) Add support to allow users to acquire delegation tokens for other users

2022-06-22 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-6945:
---
Affects Version/s: 3.3.0

> Add support to allow users to acquire delegation tokens for other users
> ---
>
> Key: KAFKA-6945
> URL: https://issues.apache.org/jira/browse/KAFKA-6945
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Manikumar
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: needs-kip
> Fix For: 3.3.0
>
>
> Currently, we only allow a user to create delegation token for that user 
> only. 
> We should allow users to acquire delegation tokens for other users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (KAFKA-6945) Add support to allow users to acquire delegation tokens for other users

2022-06-22 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reopened KAFKA-6945:


> Add support to allow users to acquire delegation tokens for other users
> ---
>
> Key: KAFKA-6945
> URL: https://issues.apache.org/jira/browse/KAFKA-6945
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Manikumar
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: needs-kip
> Fix For: 3.3.0
>
>
> Currently, we only allow a user to create delegation token for that user 
> only. 
> We should allow users to acquire delegation tokens for other users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (KAFKA-13949) Connect /connectors endpoint should support querying the active topics and the task configs

2022-05-31 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-13949:
---

 Summary: Connect /connectors endpoint should support querying the 
active topics and the task configs
 Key: KAFKA-13949
 URL: https://issues.apache.org/jira/browse/KAFKA-13949
 Project: Kafka
  Issue Type: Improvement
  Components: KafkaConnect
Affects Versions: 3.2.0
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


The /connectors endpoint supports the "expand" query parameter, which acts as a 
set of queried categories, currently supporting info (config) and status 
(monitoring status).

The endpoint should also support adding the active topics of a connector, and 
adding the separate task configs, too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-13917) Avoid calling lookupCoordinator() in tight loop

2022-05-19 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-13917:

Description: 
Currently the heartbeat thread's lookupCoordinator() is called in a tight loop 
if brokers crash and the consumer is left running. Besides that it floods the 
logs on debug level, it increases CPU usage as well.

The fix is easy, just need to put a backoff call after coordinator lookup.

Reproduction:
# Start a few brokers
# Create a topic and produce to it
# Start consuming
# Stop all brokers
At this point lookupCoordinator() will be called in a tight loop.


  was:
Currently the heartbeat thread's lookupCoordinator() is called in a tight loop 
if brokers crash and the consumer is left running. Besides that it floods the 
logs on debug level, it increases CPU usage as well.

The fix is easy, just need to put a backoff call after coordinator lookup.


> Avoid calling lookupCoordinator() in tight loop
> ---
>
> Key: KAFKA-13917
> URL: https://issues.apache.org/jira/browse/KAFKA-13917
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> Currently the heartbeat thread's lookupCoordinator() is called in a tight 
> loop if brokers crash and the consumer is left running. Besides that it 
> floods the logs on debug level, it increases CPU usage as well.
> The fix is easy, just need to put a backoff call after coordinator lookup.
> Reproduction:
> # Start a few brokers
> # Create a topic and produce to it
> # Start consuming
> # Stop all brokers
> At this point lookupCoordinator() will be called in a tight loop.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (KAFKA-13917) Avoid calling lookupCoordinator() in tight loop

2022-05-19 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-13917:
---

 Summary: Avoid calling lookupCoordinator() in tight loop
 Key: KAFKA-13917
 URL: https://issues.apache.org/jira/browse/KAFKA-13917
 Project: Kafka
  Issue Type: Improvement
  Components: consumer
Affects Versions: 3.1.1, 3.1.0, 3.1.2
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


Currently the heartbeat thread's lookupCoordinator() is called in a tight loop 
if brokers crash and the consumer is left running. Besides that it floods the 
logs on debug level, it increases CPU usage as well.

The fix is easy, just need to put a backoff call after coordinator lookup.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (KAFKA-9118) LogDirFailureHandler shouldn't use Zookeeper

2022-05-17 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538335#comment-17538335
 ] 

Viktor Somogyi-Vass edited comment on KAFKA-9118 at 5/17/22 5:06 PM:
-

[~cmccabe] thanks for the comment, reassigned this jira to [~mumrah].


was (Author: viktorsomogyi):
[~cmccabe] thanks for the comment, reassigned it to [~mumrah].

> LogDirFailureHandler shouldn't use Zookeeper
> 
>
> Key: KAFKA-9118
> URL: https://issues.apache.org/jira/browse/KAFKA-9118
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Viktor Somogyi-Vass
>Assignee: David Arthur
>Priority: Major
>
> As described in 
> [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this 
> replica. More specifically, the replica is no longer considered leader and is 
> removed from any replica fetcher thread. (The clients will receive a 
> UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under 
> path /log_dir_event_notification with data of the format {"version" : 1, 
> "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event 
> type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state 
> of all topic partitions on the broker. The LeaderAndIsrResponse from this 
> broker will specify KafkaStorageException for those partitions that are on 
> the bad log directories.
> - The controller updates the information of offline replicas in memory and 
> trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends 
> LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by 
> sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a 
> notification message to the controller with the offline partitions. The 
> controller then updates the information of offline replicas in memory and 
> trigger leader election, then removes the replicas from ISR in ZK and sends a 
> LAIR and an UpdateMetadataRequest.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (KAFKA-9118) LogDirFailureHandler shouldn't use Zookeeper

2022-05-17 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-9118:
--

Assignee: David Arthur  (was: Viktor Somogyi-Vass)

> LogDirFailureHandler shouldn't use Zookeeper
> 
>
> Key: KAFKA-9118
> URL: https://issues.apache.org/jira/browse/KAFKA-9118
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Viktor Somogyi-Vass
>Assignee: David Arthur
>Priority: Major
>
> As described in 
> [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this 
> replica. More specifically, the replica is no longer considered leader and is 
> removed from any replica fetcher thread. (The clients will receive a 
> UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under 
> path /log_dir_event_notification with data of the format {"version" : 1, 
> "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event 
> type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state 
> of all topic partitions on the broker. The LeaderAndIsrResponse from this 
> broker will specify KafkaStorageException for those partitions that are on 
> the bad log directories.
> - The controller updates the information of offline replicas in memory and 
> trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends 
> LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by 
> sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a 
> notification message to the controller with the offline partitions. The 
> controller then updates the information of offline replicas in memory and 
> trigger leader election, then removes the replicas from ISR in ZK and sends a 
> LAIR and an UpdateMetadataRequest.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KAFKA-9118) LogDirFailureHandler shouldn't use Zookeeper

2022-05-17 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538335#comment-17538335
 ] 

Viktor Somogyi-Vass commented on KAFKA-9118:


[~cmccabe] thanks for the comment, reassigned it to [~mumrah].

> LogDirFailureHandler shouldn't use Zookeeper
> 
>
> Key: KAFKA-9118
> URL: https://issues.apache.org/jira/browse/KAFKA-9118
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Viktor Somogyi-Vass
>Assignee: David Arthur
>Priority: Major
>
> As described in 
> [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this 
> replica. More specifically, the replica is no longer considered leader and is 
> removed from any replica fetcher thread. (The clients will receive a 
> UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under 
> path /log_dir_event_notification with data of the format {"version" : 1, 
> "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event 
> type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state 
> of all topic partitions on the broker. The LeaderAndIsrResponse from this 
> broker will specify KafkaStorageException for those partitions that are on 
> the bad log directories.
> - The controller updates the information of offline replicas in memory and 
> trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends 
> LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by 
> sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a 
> notification message to the controller with the offline partitions. The 
> controller then updates the information of offline replicas in memory and 
> trigger leader election, then removes the replicas from ISR in ZK and sends a 
> LAIR and an UpdateMetadataRequest.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KAFKA-13848) Clients remain connected after SASL re-authentication fails

2022-05-09 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533793#comment-17533793
 ] 

Viktor Somogyi-Vass commented on KAFKA-13848:
-

[~acsaki] please write an email to the d...@kafka.apache.org email list so they 
can add you as a contributors. After this you'll be able to assign the jira to 
yourself. You can raise a PR regradless though.
(more on contribution: https://kafka.apache.org/contributing)

> Clients remain connected after SASL re-authentication fails
> ---
>
> Key: KAFKA-13848
> URL: https://issues.apache.org/jira/browse/KAFKA-13848
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.1.0
> Environment: https://github.com/acsaki/kafka-sasl-reauth
>Reporter: Andras Csaki
>Assignee: Andras Csaki
>Priority: Minor
>  Labels: Authentication, OAuth2, SASL
>
> Clients remain connected and able to produce or consume despite an expired 
> OAUTHBEARER token.
> The problem can be reproduced using the 
> https://github.com/acsaki/kafka-sasl-reauth project by starting the embedded 
> OAuth2 server and Kafka, then running the long running consumer in 
> OAuthBearerTest and then killing the OAuth2 server thus making the client 
> unable to re-authenticate.
> Root cause seems to be 
> SaslServerAuthenticator#calcCompletionTimesAndReturnSessionLifetimeMs failing 
> to set ReauthInfo#sessionExpirationTimeNanos when tokens have already expired 
> (when session life time goes negative), in turn causing 
> KafkaChannel#serverAuthenticationSessionExpired returning false and finally 
> SocketServer not closing the channel.
> The issue is observed with OAUTHBEARER but seems to have a wider impact on 
> SASL re-authentication.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-10430) Hook support

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-10430:

Labels: cloudera  (was: )

> Hook support
> 
>
> Key: KAFKA-10430
> URL: https://issues.apache.org/jira/browse/KAFKA-10430
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dennis Jaheruddin
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
>
> Currently big data storage (like HDFS and HBASE) allows other tooling to hook 
> into it, for instance atlas.
> As data movement tools become more open to Kafka as well, it makes sense to 
> shift significant amounts of storage to Kafka, for instance when one just 
> needs a buffer.
> However, this may be blocked due to governance constraints. As currently 
> producers and consumers would need to actively make an effort to log 
> governance (where something like HDFS can guarantee its capture).
> Hence I believe we should make it possible to hook into kafka as well so one 
> does not simply depend on the integrity of the producers and consumers.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-10430) Hook support

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-10430:

Priority: Critical  (was: Major)

> Hook support
> 
>
> Key: KAFKA-10430
> URL: https://issues.apache.org/jira/browse/KAFKA-10430
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dennis Jaheruddin
>Assignee: Viktor Somogyi-Vass
>Priority: Critical
>  Labels: cloudera
>
> Currently big data storage (like HDFS and HBASE) allows other tooling to hook 
> into it, for instance atlas.
> As data movement tools become more open to Kafka as well, it makes sense to 
> shift significant amounts of storage to Kafka, for instance when one just 
> needs a buffer.
> However, this may be blocked due to governance constraints. As currently 
> producers and consumers would need to actively make an effort to log 
> governance (where something like HDFS can guarantee its capture).
> Hence I believe we should make it possible to hook into kafka as well so one 
> does not simply depend on the integrity of the producers and consumers.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (KAFKA-8195) Unstable Producer After Send Failure in Transaction

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-8195:
--

Assignee: (was: Viktor Somogyi-Vass)

> Unstable Producer After Send Failure in Transaction
> ---
>
> Key: KAFKA-8195
> URL: https://issues.apache.org/jira/browse/KAFKA-8195
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.0.1, 2.2.0, 2.3.0
>Reporter: Gary Russell
>Priority: Major
>
> This journey started with [this Stack Overflow question | 
> https://stackoverflow.com/questions/55510898].
> I easily reproduced his issue (see my comments on his question).
> My first step was to take Spring out of the picture and replicate the issue 
> with the native {{Producer}} apis. The following code shows the result; I 
> have attached logs and stack traces.
> There are four methods in the test; the first performs 2 transactions, 
> successfully, on the same {{Producer}} instance.
> The second aborts 2 transactions, successfully, on the same {{Producer}} 
> instance - application level failures after performing a send.
> There are two flavors of the problem:
> The third attempts to send 2 messages, on the same {{Producer}} that are too 
> large for the topic; the first aborts as expected; the second send hangs in 
> {{abortTransaction}} after getting a {{TimeoutException}} when attempting to 
> {{get}} the send metadata. See log {{hang.same.producer.log}} - it also 
> includes the stack trace showing the hang.
> The fourth test is similar to the third but it closes the producer after the 
> first failure; this time, we timeout in {{initTransactions()}}.
> Subsequent executions of this test get the timeout on the first attempt - 
> that {{transactional.id}} seems to be unusable. Removing the logs was one way 
> I found to get past the problem.
> Test code
> {code:java}
>   public ApplicationRunner runner(AdminClient client, 
> DefaultKafkaProducerFactory pf) {
>   return args -> {
>   try {
>   Map configs = new 
> HashMap<>(pf.getConfigurationProperties());
>   int committed = testGoodTx(client, configs);
>   System.out.println("Successes (same producer) 
> committed: " + committed);
>   int rolledBack = testAppFailureTx(client, 
> configs);
>   System.out.println("App failures (same 
> producer) rolled back: " + rolledBack);
>   
>   // first flavor - hung thread in 
> abortTransaction()
> //rolledBack = 
> testSendFailureTxSameProducer(client, configs);
> //System.out.println("Send failures (same 
> producer) rolled back: " + rolledBack);
>   
>   // second flavor - timeout in initTransactions()
>   rolledBack = 
> testSendFailureTxNewProducer(client, configs);
>   System.out.println("Send failures (new 
> producer) rolled back: " + rolledBack);
>   }
>   catch (Exception e) {
>   e.printStackTrace();
>   }
>   };
>   }
>   private int testGoodTx(AdminClient client, Map configs)
>   throws ExecutionException {
>   int commits = 0;
>   NewTopic topic = TopicBuilder.name("so55510898a")
>   .partitions(1)
>   .replicas(1)
>   .build();
>   createTopic(client, topic);
>   configs.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "txa-");
>   Producer producer = new 
> KafkaProducer<>(configs);
>   try {
>   producer.initTransactions();
>   for (int i = 0; i < 2; i++) {
>   producer.beginTransaction();
>   RecordMetadata recordMetadata = producer.send(
>   new 
> ProducerRecord<>("so55510898a", "foo")).get(10, 
> TimeUnit.SECONDS);
>   System.out.println(recordMetadata);
>   producer.commitTransaction();
>   commits++;
>   }
>   }
>   catch (ProducerFencedException | OutOfOrderSequenceException | 
> AuthorizationException e) {
>   // We can't recover from these exceptions, so our only 
> option is to close the producer and exit.
>   }
>   catch 

[jira] [Updated] (KAFKA-10650) Use Murmur3 hashing instead of MD5 in SkimpyOffsetMap

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-10650:

Labels: cloudera  (was: )

> Use Murmur3 hashing instead of MD5 in SkimpyOffsetMap
> -
>
> Key: KAFKA-10650
> URL: https://issues.apache.org/jira/browse/KAFKA-10650
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
> Attachments: benchmark-evidence.png, benchmark-run-output
>
>
> The usage of MD5 has been uncovered during testing Kafka for FIPS (Federal 
> Information Processing Standards) verification.
> While MD5 isn't a FIPS incompatibility here as it isn't used for 
> cryptographic purposes, I spent some time with this as it isn't ideal either. 
> MD5 is a relatively fast crypto hashing algo but there are much better 
> performing algorithms for hash tables as it's used in SkimpyOffsetMap.
> By applying Murmur3 (that is implemented in Streams) I could achieve a 3x 
> faster {{put}} operation and the overall segment cleaning sped up by 30% 
> while preserving the same collision rate (both performed within 0.0015 - 
> 0.007, mostly with 0.004 median).
> The usage of Murmur3 was decided as research paper [1] shows Murmur2 is 
> relatively a good choice for hash tables. Based on this Since Murmur3 is 
> available in the project I used that. 
> [1]
> https://www.researchgate.net/publication/235663569_Performance_of_the_most_common_non-cryptographic_hash_functions
> Benchmark evidence (the smaller the better as this is average time):
>  !benchmark-evidence.png! 
> The benchmark can be reproduced by running {{./jmh.sh LogCleanerBenchmark}} 
> from 
> https://github.com/viktorsomogyi/kafka/tree/KAFKA-3987-hash-algorithm-murmur 
> in the {{jmh-benchmark}} folder.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-13504) Retry connect internal topics' creation in case of InvalidReplicationFactorException

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-13504:

Labels: cloudera  (was: )

> Retry connect internal topics' creation in case of 
> InvalidReplicationFactorException
> 
>
> Key: KAFKA-13504
> URL: https://issues.apache.org/jira/browse/KAFKA-13504
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Andras Katona
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
>
> In case the Kafka Broker cluster and the Kafka Connect cluster is started 
> together and Connect would want to create its topics, there's a high chance 
> to fail the creation with InvalidReplicationFactorException.
> {noformat}
> ERROR org.apache.kafka.connect.runtime.distributed.DistributedHerder [Worker 
> clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder 
> work thread, exiting: 
> org.apache.kafka.connect.errors.ConnectException: Error while attempting to 
> create/find topic(s) 'connect-offsets'
> ...
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 2.
> {noformat}
> Introducing a retry logic here would make Connect a bit more robust.
> New configurations:
> * offset.storage.topic.create.retries
> * offset.storage.topic.create.retry.backoff.ms
> * config.storage.topic.create.retries
> * config.storage.topic.create.retry.backoff.ms
> * status.storage.topic.create.retries
> * status.storage.topic.create.retry.backoff.ms



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-10586) Full support for distributed mode in dedicated MirrorMaker 2.0 clusters

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-10586:

Labels: cloudera  (was: )

> Full support for distributed mode in dedicated MirrorMaker 2.0 clusters
> ---
>
> Key: KAFKA-10586
> URL: https://issues.apache.org/jira/browse/KAFKA-10586
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
>
> KIP-382 introduced MirrorMaker 2.0. Because of scoping issues, the dedicated 
> MirrorMaker 2.0 cluster does not utilize the Connect REST API. This means 
> that with specific workloads, the dedicated MM2 cluster can become unable to 
> react to dynamic topic and group filter changes.
> (This occurs when after a rebalance operation, the leader node has no 
> MirrorSourceConnectorTasks. Because of this, the MirrorSourceConnector is 
> stopped on the leader, meaning it cannot detect config changes by itself. 
> Followers still running the connector can detect config changes, but they 
> cannot query the leader for config updates.)
> Besides the REST support, config provider references should be evaluated 
> lazily in connector configurations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-6187) Remove the Logging class in favor of LazyLogging in scala-logging

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-6187:
---
Priority: Minor  (was: Major)

> Remove the Logging class in favor of LazyLogging in scala-logging
> -
>
> Key: KAFKA-6187
> URL: https://issues.apache.org/jira/browse/KAFKA-6187
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Minor
>
> In KAFKA-1044 we removed the hard dependency on junit and enabled users to 
> exclude it in their environment without causing any problems. We also agreed 
> to remove the kafka.utils.Logging class as it can be made redundant by 
> LazyLogging in scala-logging.
> In this JIRA we will get rid of Logging by refactoring its remaining 
> functionalities (swallow* methods and logIdent).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-1774) REPL and Shell Client for Admin Message RQ/RP

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-1774:
---
Priority: Minor  (was: Major)

> REPL and Shell Client for Admin Message RQ/RP
> -
>
> Key: KAFKA-1774
> URL: https://issues.apache.org/jira/browse/KAFKA-1774
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Joe Stein
>Assignee: Viktor Somogyi-Vass
>Priority: Minor
>
> We should have a REPL we can work in and execute the commands with the 
> arguments. With this we can do:
> ./kafka.sh --shell 
> kafka>attach cluster -b localhost:9092;
> kafka>describe topic sampleTopicNameForExample;
> the command line version can work like it does now so folks don't have to 
> re-write all of their tooling.
> kafka.sh --topics --everything the same like kafka-topics.sh is 
> kafka.sh --reassign --everything the same like kafka-reassign-partitions.sh 
> is 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-8468) AdminClient.deleteTopics doesn't wait until topic is deleted

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-8468:
---
Labels: cloudera  (was: )

> AdminClient.deleteTopics doesn't wait until topic is deleted
> 
>
> Key: KAFKA-8468
> URL: https://issues.apache.org/jira/browse/KAFKA-8468
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>  Labels: cloudera
>
> Please see the example app to reproduce the issue: 
> https://github.com/gaborgsomogyi/kafka-topic-stress
> ZKUtils is deprecated from Kafka version 2.0.0 but there is no real 
> alternative.
> * deleteTopics doesn't wait
> * ZookeeperClient has "private [kafka]" visibility



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (KAFKA-13240) HTTP TRACE should be disabled in Connect

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass updated KAFKA-13240:

Labels: cloudera  (was: )

> HTTP TRACE should be disabled in Connect
> 
>
> Key: KAFKA-13240
> URL: https://issues.apache.org/jira/browse/KAFKA-13240
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Minor
>  Labels: cloudera
>
> Modern browsers mostly disable HTTP TRACE to prevent XST (cross-site 
> tracking) attacks.  Because of this usually this type of attack isn't too 
> prevalent these days but since it isn't disabled in Connect it may open up 
> possible ways of attacks (and constantly pops up in security scans :) ). 
> Therefore we'd like to disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-6084) ReassignPartitionsCommand should propagate JSON parsing failures

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-6084.

Fix Version/s: 2.8.0
   Resolution: Fixed

> ReassignPartitionsCommand should propagate JSON parsing failures
> 
>
> Key: KAFKA-6084
> URL: https://issues.apache.org/jira/browse/KAFKA-6084
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 0.11.0.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 2.8.0
>
> Attachments: Screen Shot 2017-10-18 at 23.31.22.png
>
>
> Basically looking at Json.scala it will always swallow any parsing errors:
> {code}
>   def parseFull(input: String): Option[JsonValue] =
> try Option(mapper.readTree(input)).map(JsonValue(_))
> catch { case _: JsonProcessingException => None }
> {code}
> However sometimes it is easy to figure out the problem by simply looking at 
> the JSON, in some cases it is not very trivial, such as some invisible 
> characters (like byte order mark) won't be displayed by most of the text 
> editors and can people spend time on figuring out what's the problem.
> As Jackson provides a really detailed exception about what failed and how, it 
> is easy to propagate the failure to the user.
> As an example I attached a BOM prefixed JSON which fails with the following 
> error which is very counterintuitive:
> {noformat}
> [root@localhost ~]# kafka-reassign-partitions --zookeeper localhost:2181 
> --reassignment-json-file /root/increase-replication-factor.json --execute
> Partitions reassignment failed due to Partition reassignment data file 
> /root/increase-replication-factor.json is empty
> kafka.common.AdminCommandFailedException: Partition reassignment data file 
> /root/increase-replication-factor.json is empty
> at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:120)
> at 
> kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:52)
> at kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
> ...
> {noformat}
> In case of the above error it would be much better to see what fails exactly:
> {noformat}
> kafka.common.AdminCommandFailedException: Admin command failed
>   at 
> kafka.admin.ReassignPartitionsCommand$.parsePartitionReassignmentData(ReassignPartitionsCommand.scala:267)
>   at 
> kafka.admin.ReassignPartitionsCommand$.parseAndValidate(ReassignPartitionsCommand.scala:275)
>   at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:197)
>   at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:193)
>   at 
> kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:64)
>   at 
> kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
> Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected 
> character ('' (code 65279 / 0xfeff)): expected a valid value (number, 
> String, array, object, 'true', 'false' or 'null')
>  at [Source: (String)"{"version":1,
>   "partitions":[
>{"topic": "test1", "partition": 0, "replicas": [1,2]},
>{"topic": "test2", "partition": 1, "replicas": [2,3]}
> ]}"; line: 1, column: 2]
>   at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1798)
>   at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:663)
>   at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:561)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1892)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:747)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4030)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2539)
>   at kafka.utils.Json$.kafka$utils$Json$$doParseFull(Json.scala:46)
>   at kafka.utils.Json$$anonfun$tryParseFull$1.apply(Json.scala:44)
>   at kafka.utils.Json$$anonfun$tryParseFull$1.apply(Json.scala:44)
>   at scala.util.Try$.apply(Try.scala:192)
>   at kafka.utils.Json$.tryParseFull(Json.scala:44)
>   at 
> kafka.admin.ReassignPartitionsCommand$.parsePartitionReassignmentData(ReassignPartitionsCommand.scala:241)
>   ... 5 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-13442) REST API endpoint for fetching a connector's config definition

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-13442.
-
Resolution: Duplicate

> REST API endpoint for fetching a connector's config definition
> --
>
> Key: KAFKA-13442
> URL: https://issues.apache.org/jira/browse/KAFKA-13442
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 3.2.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> To enhance UI based applications' capability in helping users to create new 
> connectors from default configurations, it would be very good to have an API 
> which can fetch a connector type's configuration definition which will be 
> filled out by users and sent back for validation and then creating a new 
> connector out of it.
> The API should be placed under {{connector-plugins}} and since 
> {{connector-plugins/\{connectorType\}/config/validate}} already exists, 
> {{connector-plugins/\{connectorType\}/config}} might be a good option for the 
> new API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (KAFKA-13504) Retry connect internal topics' creation in case of InvalidReplicationFactorException

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-13504:
---

Assignee: Viktor Somogyi-Vass  (was: Andras Katona)

> Retry connect internal topics' creation in case of 
> InvalidReplicationFactorException
> 
>
> Key: KAFKA-13504
> URL: https://issues.apache.org/jira/browse/KAFKA-13504
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Andras Katona
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> In case the Kafka Broker cluster and the Kafka Connect cluster is started 
> together and Connect would want to create its topics, there's a high chance 
> to fail the creation with InvalidReplicationFactorException.
> {noformat}
> ERROR org.apache.kafka.connect.runtime.distributed.DistributedHerder [Worker 
> clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder 
> work thread, exiting: 
> org.apache.kafka.connect.errors.ConnectException: Error while attempting to 
> create/find topic(s) 'connect-offsets'
> ...
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 2.
> {noformat}
> Introducing a retry logic here would make Connect a bit more robust.
> New configurations:
> * offset.storage.topic.create.retries
> * offset.storage.topic.create.retry.backoff.ms
> * config.storage.topic.create.retries
> * config.storage.topic.create.retry.backoff.ms
> * status.storage.topic.create.retries
> * status.storage.topic.create.retry.backoff.ms



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-13452) MM2 creates invalid checkpoint when offset mapping is not available

2022-04-28 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass resolved KAFKA-13452.
-
Resolution: Duplicate

> MM2 creates invalid checkpoint when offset mapping is not available
> ---
>
> Key: KAFKA-13452
> URL: https://issues.apache.org/jira/browse/KAFKA-13452
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> MM2 checkpointing reads the offset-syncs topic to create offset mappings for 
> committed consumer group offsets. In some corner cases, it is possible that a 
> mapping is not available in offset-syncs - in that case, MM2 simply copies 
> the source offset, which might not be a valid offset in the replica topic at 
> all.
> One possible situation is if there is an empty topic in the source cluster 
> with a non-zero endoffset (e.g. retention already removed the records), and a 
> consumer group which has a committed offset set to the end offset. If 
> replication is configured to start replicating this topic, it will not have 
> an offset mapping available in offset-syncs (as the topic is empty), causing 
> MM2 to copy the source offset.
> This can cause issues when auto offset sync is enabled, as the consumer group 
> offset can be potentially set to a high number. MM2 never rewinds these 
> offsets, so even when there is a correct offset mapping available, the offset 
> will not be updated correctly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (KAFKA-10586) Full support for distributed mode in dedicated MirrorMaker 2.0 clusters

2022-04-26 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-10586:
---

Assignee: Viktor Somogyi-Vass  (was: Daniel Urban)

> Full support for distributed mode in dedicated MirrorMaker 2.0 clusters
> ---
>
> Key: KAFKA-10586
> URL: https://issues.apache.org/jira/browse/KAFKA-10586
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> KIP-382 introduced MirrorMaker 2.0. Because of scoping issues, the dedicated 
> MirrorMaker 2.0 cluster does not utilize the Connect REST API. This means 
> that with specific workloads, the dedicated MM2 cluster can become unable to 
> react to dynamic topic and group filter changes.
> (This occurs when after a rebalance operation, the leader node has no 
> MirrorSourceConnectorTasks. Because of this, the MirrorSourceConnector is 
> stopped on the leader, meaning it cannot detect config changes by itself. 
> Followers still running the connector can detect config changes, but they 
> cannot query the leader for config updates.)
> Besides the REST support, config provider references should be evaluated 
> lazily in connector configurations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (KAFKA-13459) MM2 should be able to add the source offset to the record header

2022-04-26 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-13459:
---

Assignee: Viktor Somogyi-Vass

> MM2 should be able to add the source offset to the record header
> 
>
> Key: KAFKA-13459
> URL: https://issues.apache.org/jira/browse/KAFKA-13459
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> MM2 could add the source offset to the record header to help with diagnostics 
> in some use-cases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (KAFKA-13659) MM2 should read all offset syncs at start up and should not set consumer offset higher than the end offset

2022-02-09 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-13659:
---

Assignee: Kanalas Vidor

> MM2 should read all offset syncs at start up and should not set consumer 
> offset higher than the end offset
> --
>
> Key: KAFKA-13659
> URL: https://issues.apache.org/jira/browse/KAFKA-13659
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Kanalas Vidor
>Assignee: Kanalas Vidor
>Priority: Major
>
> - MirrorCheckpointTask uses OffsetSyncStore, and does not check whether 
> OffsetSyncStore managed to read to the "end" of the offset-syncs topic. 
> OffsetSyncStore should fetch the endoffset of the topic at startup, and set a 
> flag when it finally reaches the endoffset in consumption. 
> MirrorCheckpointTask.poll should wait for this flag to be true before doing 
> any in-memory updates and group offset management.
>  - MirrorCheckpointTask can create checkpoints which point into the "future" 
> - meaning it sometimes translates consumer offsets in a way that the target 
> offset is greater than the endoffset of the replica topic partition. 
> MirrorCheckpointTask should fetch the endoffsets of the affected topics, and 
> make sure that it does not try to set the consumer offset to anything higher 
> than the endoffset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (KAFKA-13452) MM2 creates invalid checkpoint when offset mapping is not available

2021-11-12 Thread Viktor Somogyi-Vass (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Somogyi-Vass reassigned KAFKA-13452:
---

Assignee: Viktor Somogyi-Vass

> MM2 creates invalid checkpoint when offset mapping is not available
> ---
>
> Key: KAFKA-13452
> URL: https://issues.apache.org/jira/browse/KAFKA-13452
> Project: Kafka
>  Issue Type: Improvement
>  Components: mirrormaker
>Reporter: Daniel Urban
>Assignee: Viktor Somogyi-Vass
>Priority: Major
>
> MM2 checkpointing reads the offset-syncs topic to create offset mappings for 
> committed consumer group offsets. In some corner cases, it is possible that a 
> mapping is not available in offset-syncs - in that case, MM2 simply copies 
> the source offset, which might not be a valid offset in the replica topic at 
> all.
> One possible situation is if there is an empty topic in the source cluster 
> with a non-zero endoffset (e.g. retention already removed the records), and a 
> consumer group which has a committed offset set to the end offset. If 
> replication is configured to start replicating this topic, it will not have 
> an offset mapping available in offset-syncs (as the topic is empty), causing 
> MM2 to copy the source offset.
> This can cause issues when auto offset sync is enabled, as the consumer group 
> offset can be potentially set to a high number. MM2 never rewinds these 
> offsets, so even when there is a correct offset mapping available, the offset 
> will not be updated correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KAFKA-13442) REST API endpoint for fetching a connector's config definition

2021-11-10 Thread Viktor Somogyi-Vass (Jira)
Viktor Somogyi-Vass created KAFKA-13442:
---

 Summary: REST API endpoint for fetching a connector's config 
definition
 Key: KAFKA-13442
 URL: https://issues.apache.org/jira/browse/KAFKA-13442
 Project: Kafka
  Issue Type: Improvement
  Components: KafkaConnect
Affects Versions: 3.2.0
Reporter: Viktor Somogyi-Vass
Assignee: Viktor Somogyi-Vass


To enhance UI based applications' capability in helping users to create new 
connectors from default configurations, it would be very good to have an API 
which can fetch a connector type's configuration definition which will be 
filled out by users and sent back for validation and then creating a new 
connector out of it.

The API should be placed under {{connector-plugins}} and since 
{{connector-plugins/\{connectorType\}/config/validate}} already exists, 
{{connector-plugins/\{connectorType\}/config}} might be a good option for the 
new API.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KAFKA-6668) Broker crashes on restart ,got a CorruptRecordException: Record size is smaller than minimum record overhead(14)

2021-10-11 Thread Viktor Somogyi-Vass (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427097#comment-17427097
 ] 

Viktor Somogyi-Vass commented on KAFKA-6668:


[~little brother ma], [~alchimie], [~borisvu] have any of you figured out the 
problem here? Was it related to a disk issue or something in Kafka?

> Broker crashes on restart ,got a CorruptRecordException: Record size is 
> smaller than minimum record overhead(14)
> 
>
> Key: KAFKA-6668
> URL: https://issues.apache.org/jira/browse/KAFKA-6668
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.11.0.1
> Environment: Linux version :
> 3.10.0-514.26.2.el7.x86_64 (mockbuild@cgslv5.buildsys213) (gcc version 4.8.5 
> 20150623 (Red Hat 4.8.5-11)
> docker version:
> Client:
>  Version:  1.12.6
>  API version:  1.24
>  Go version:   go1.7.5
>  Git commit:   ad3fef854be2172454902950240a9a9778d24345
>  Built:Mon Jan 15 22:01:14 2018
>  OS/Arch:  linux/amd64
>Reporter: little brother ma
>Priority: Major
>
> There is a kafka cluster with a broker ,running in docker container. Because 
> of disk full, the container crashes, and gets restarted again, and crashes 
> again...  
> log when disk full :
>  
> {code:java}
> [2018-03-14 00:11:40,764] INFO Rolled new log segment for 'oem-debug-log-1' 
> in 1 ms. (kafka.log.Log) [2018-03-14 00:11:40,765] ERROR Uncaught exception 
> in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) 
> java.io.IOException: I/O error at sun.nio.ch.FileDispatcherImpl.force0(Native 
> Method) at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) at 
> sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) at 
> org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:162) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:377) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at 
> kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:376) at 
> kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at 
> kafka.log.LogSegment.flush(LogSegment.scala:376) at 
> kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1312) at 
> kafka.log.Log$$anonfun$flush$2.apply(Log.scala:1311) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> kafka.log.Log.flush(Log.scala:1311) at 
> kafka.log.Log$$anonfun$roll$1.apply$mcV$sp(Log.scala:1283) at 
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) 
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) [2018-03-14 00:11:56,514] ERROR 
> [KafkaApi-7] Error when handling request 
> {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=oem-debug- 
> log,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576},{partition=1,fetch_offset=131382630,max_bytes=1048576}]}]}
>  (kafka.server.KafkaAp is) 
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> {code}
>  
>  
>  
> log when resolved issue of disk full,and kafka restart:
>  
> {code:java}
> [2018-03-15 23:00:08,998] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/kafka/kafka-logs/__consumer
> _offsets-19/03396188.index) has non-zero size but the last offset 
> is 3396188 which is no larger than the base offset 3396188.}. deleting
>  /kafka/kafka-logs/__consumer_offsets-19/03396188.timeindex, 
> /kafka/kafka-logs/__consumer_offsets-19/03396188.index, and /ka
> fka/kafka-logs/__consumer_offsets-19/03396188.txnindex and 
> rebuilding index... (kafka.log.Log)
> [2018-03-15 23:00:08,999] INFO Loading producer state from snapshot file 
> '/kafka/kafka-logs/__consumer_offsets-19/03396188.snapshot' for
>  partition __consumer_offsets-19 (kafka.log.ProducerStateManager)
> [2018-03-15 23:00:09,242] INFO Recovering unflushed 

  1   2   3   >