from:"Ron Dagostino \(Jira\)"

[jira] [Commented] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800845#comment-17800845
 ] 

Ron Dagostino commented on KAFKA-15495:
---

Thanks, [~jsancio].  I've updated the title and description to make it clear 
this is a general problem as opposed to being KRaft-specific, and I've 
indicated it affects all released versions back to 1.0.0.  I've also linked it 
to the ELR ticket at https://issues.apache.org/jira/browse/KAFKA-15332.




> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, 
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 
> 2.7.0, 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 
> 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 
> 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition has just a single leader replica in the ISR.  Assume 
> next that this replica goes offline.  This replica's log will define the 
> contents of that partition when the replica restarts, which is correct 
> behavior.  However, assume now that the replica has a disk failure, and we 
> then replace the failed disk with a new, empty disk that we also format with 
> the storage tool so it has the correct cluster ID.  If we then restart the 
> broker, the topic-partition will have no data in it, and any other replicas 
> that might exist will truncate their logs to match, which results in data 
> loss.  See below for a step-by-step demo of how to reproduce this using KRaft 
> (the issue impacts ZK-based implementations as well, but we supply only a 
> KRaft-based reproduce case here):
> Note that implementing Eligible leader Replicas 
> (https://issues.apache.org/jira/browse/KAFKA-15332) will resolve this issue.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhos

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Description: 
Assume a topic-partition has just a single leader replica in the ISR.  Assume 
next that this replica goes offline.  This replica's log will define the 
contents of that partition when the replica restarts, which is correct 
behavior.  However, assume now that the replica has a disk failure, and we then 
replace the failed disk with a new, empty disk that we also format with the 
storage tool so it has the correct cluster ID.  If we then restart the broker, 
the topic-partition will have no data in it, and any other replicas that might 
exist will truncate their logs to match, which results in data loss.  See below 
for a step-by-step demo of how to reproduce this using KRaft (the issue impacts 
ZK-based implementations as well, but we supply only a KRaft-based reproduce 
case here):

Note that implementing Eligible leader Replicas 
(https://issues.apache.org/jira/browse/KAFKA-15332) will resolve this issue.

STEPS TO REPRODUCE:

Create a single broker cluster with single controller.  The standard files 
under config/kraft work well:

bin/kafka-storage.sh random-uuid
J8qXRwI-Qyi2G0guFTiuYw

#ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/controller.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker.properties

bin/kafka-server-start.sh config/kraft/controller.properties
bin/kafka-server-start.sh config/kraft/broker.properties

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
--partitions 1 --replication-factor 1

#create __consumer-offsets topics
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
--from-beginning
^C

#confirm that __consumer_offsets topic partitions are all created and on broker 
with node id 2
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

Now create 2 more brokers, with node IDs 11 and 12

cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
's/localhost:9092/localhost:9011/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
config/kraft/broker11.properties
cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
's/localhost:9092/localhost:9012/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
config/kraft/broker12.properties

#ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker11.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker12.properties

bin/kafka-server-start.sh config/kraft/broker11.properties
bin/kafka-server-start.sh config/kraft/broker12.properties

#create a topic with a single partition replicated on two brokers
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
--partitions 1 --replication-factor 2

#reassign partitions onto brokers with Node IDs 11 and 12
echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
"version":1}' > /tmp/reassign.json

bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
--reassignment-json-file /tmp/reassign.json --execute
bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
--reassignment-json-file /tmp/reassign.json --verify

#make preferred leader 11 the actual leader if it not
bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
--all-topic-partitions --election-type preferred

#Confirm both brokers are in ISR and 11 is the leader
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2Configs: segment.bytes=1073741824
Topic: foo2 Partition: 0Leader: 11  Replicas: 11,12 Isr: 
12,11


#Emit some messages to the topic
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic foo2
1
2
3
4
5
^C

#confirm we see the messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo2 
--from-beginning
1
2
3
4
5
^C

#Again confirm both brokers are in ISR, leader is 11
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2Configs: segment.bytes=1073741824
Topic: foo2 Partition: 0Leader: 11  Replicas: 11,12 Isr: 
12,11

#kill non-leader broker
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 2.2.1
   2.3.0
   2.1.1
   2.2.0
   2.1.0
   2.0.1
   2.0.0
   1.1.1
   1.1.0
   1.0.2
   1.0.1
   1.0.0

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, 
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 
> 2.7.0, 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 
> 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 
> 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bi

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 2.7.2
   2.6.3
   3.1.0
   2.6.2
   2.7.1
   2.8.0
   2.6.1
   2.7.0
   2.5.1
   2.6.0
   2.4.1
   2.5.0
   2.3.1
   2.4.0
   2.2.2

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, 
> 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 2.8.2, 
> 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 
> 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bi

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 3.2.1
   3.1.2
   3.0.2
   3.3.0
   3.1.1
   3.2.0
   2.8.2
   3.0.1
   3.0.0
   2.8.1

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.1, 3.0.0, 3.0.1, 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 
> 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 
> 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #mak

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 3.6.1
   3.5.2
   3.5.0
   3.3.1
   3.2.3
   3.2.2
   3.4.0

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 
> 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
> --all-topic-partitions --election-type pre

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Summary: Partition truncated when the only ISR member restarts with an 
empty disk  (was: KRaft partition truncated when the only ISR member restarts 
with an empty disk)

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
> --all-topic-partitions --election-type preferred
> #Confirm both brokers are in ISR and 11 is the leader
> bin/kafka-topic

[jira] [Updated] (KAFKA-15365) Broker-side replica management changes

2023-12-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15365:
--
Fix Version/s: 3.7.0

> Broker-side replica management changes
> --
>
> Key: KAFKA-15365
> URL: https://issues.apache.org/jira/browse/KAFKA-15365
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Omnia Ibrahim
>Priority: Major
> Fix For: 3.7.0
>
>
> On the broker side, process metadata changes to partition directories as the 
> broker catches up to metadata, as described in KIP-858 under "Replica 
> management".
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-15358) QueuedReplicaToDirAssignments metric

2023-12-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15358:
--
Fix Version/s: 3.7.0

> QueuedReplicaToDirAssignments metric
> 
>
> Key: KAFKA-15358
> URL: https://issues.apache.org/jira/browse/KAFKA-15358
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Igor Soarez
>Assignee: Michael Westerby
>Priority: Major
> Fix For: 3.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-15471) Allow independently stop KRaft controllers or brokers

2023-12-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15471:
--
Fix Version/s: 3.7.0

> Allow independently stop KRaft controllers or brokers
> -
>
> Key: KAFKA-15471
> URL: https://issues.apache.org/jira/browse/KAFKA-15471
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Hailey Ni
>Assignee: Hailey Ni
>Priority: Major
> Fix For: 3.7.0
>
>
> Some users run KRaft controllers and brokers on the same machine (not 
> containerized, but through tarballs, etc).  Prior to KRaft, when running 
> ZooKeeper and Kafka on the same machine, users could independently stop the 
> ZooKeeper node and Kafka broker since there were specific shell scripts for 
> each (zookeeper-server-stop and kafka-server-stop, respectively).
> However in KRaft mode, they can't stop the KRaft controllers independently 
> from the Kafka brokers because there is just a single script that doesn't 
> distinguish between the two processes and signals both of them. We need to 
> provide a way for users to kill either controllers or brokers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-15471) Allow independently stop KRaft controllers or brokers

2023-12-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-15471.
---
Resolution: Fixed

> Allow independently stop KRaft controllers or brokers
> -
>
> Key: KAFKA-15471
> URL: https://issues.apache.org/jira/browse/KAFKA-15471
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Hailey Ni
>Assignee: Hailey Ni
>Priority: Major
>
> Some users run KRaft controllers and brokers on the same machine (not 
> containerized, but through tarballs, etc).  Prior to KRaft, when running 
> ZooKeeper and Kafka on the same machine, users could independently stop the 
> ZooKeeper node and Kafka broker since there were specific shell scripts for 
> each (zookeeper-server-stop and kafka-server-stop, respectively).
> However in KRaft mode, they can't stop the KRaft controllers independently 
> from the Kafka brokers because there is just a single script that doesn't 
> distinguish between the two processes and signals both of them. We need to 
> provide a way for users to kill either controllers or brokers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-15710) KRaft support in ServerShutdownTest

2023-11-02 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782236#comment-17782236
 ] 

Ron Dagostino commented on KAFKA-15710:
---

The KRaft equivalent of `testCleanShutdownWithZkUnavailable` was indeed added 
via https://github.com/apache/kafka/pull/11606 (it is called 
`testCleanShutdownWithKRaftControllerUnavailable`).

However, it appears there is no KRaft test equivalent of 
`testControllerShutdownDuringSend`.  I am not sure if need one, though.  That 
test starts 2 ZK-based brokers, one of which is the controller.  It has that 
controller accept the first request it receives (LeaderAndIsrRequest in the 
test) but only read a single byte and then block.  Then it has the 
non-controller broker shutdown using the same `ControllerChannelManager` 
instance.  I think the idea is to confirm that shutdown still works when 
another request is still being sent.

Given that KRaft brokers don't use `ControllerChannelManager`, I suspect this 
test does not need a direct KRaft-equivalent.

> KRaft support in ServerShutdownTest
> ---
>
> Key: KAFKA-15710
> URL: https://issues.apache.org/jira/browse/KAFKA-15710
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Sameer Tejani
>Priority: Minor
>  Labels: kraft, kraft-test, newbie
>
> The following tests in ServerShutdownTest in 
> core/src/test/scala/unit/kafka/server/ServerShutdownTest.scala need to be 
> updated to support KRaft
> 192 : def testCleanShutdownWithZkUnavailable(quorum: String): Unit = {
> 258 : def testControllerShutdownDuringSend(quorum: String): Unit = {
> Scanned 324 lines. Found 5 KRaft tests out of 7 tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-15591) Trogdor produce workload reports errors in KRaft mode

2023-10-12 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-15591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774570#comment-17774570
 ] 

Ron Dagostino commented on KAFKA-15591:
---

> Is this caused by that in KRaft protocal, Kafka doesn't not elect leaders 
> immediately after a new topic created but rather do that on-demand after 
> receiving the first message on the topic? 

No, that is not correct.  The leader for each partition is identified at the 
time the topic-partition is created.  If the broker is responding that it does 
not know about that partition then it could be the case that it has not 
replicated and acted upon the records in the metadata log that created the 
partition and identified it as the leader.  The logs you pasted above show this 
happening between 2023-10-12 00:30:50,862 and 2023-10-12 00:30:50,876, which is 
14 ms, which doesn't seem like a lot of time.  If this were happening for 
seconds or longer then that would certainly be a problem, but 14ms at first 
glance doesn't sound alarming.

> Trogdor produce workload reports errors in KRaft mode
> -
>
> Key: KAFKA-15591
> URL: https://issues.apache.org/jira/browse/KAFKA-15591
> Project: Kafka
>  Issue Type: Bug
> Environment: Linux
>Reporter: Xi Yang
>Priority: Blocker
>
> The Kafka benchmark in the Dacapo Benchmark Suite uses the Trogdor's exec 
> mode ([https://github.com/dacapobench/dacapobench/pull/224)]  to test the 
> Kafka broker.
>  
> I am trying to update the benchmark to use the KRaft protocol. We use single 
> Kafka instant that plays both controller and broker following the guide in 
> Kafka README.md 
> (https://github.com/apache/kafka#running-a-kafka-broker-in-kraft-mode).
>  
> However, the Trogdor producing workload  
> (tests/spec/simple_produce_bench.json) reports the NOT_LEADER_OR_FOLLOWER 
> error. The errors are gone after many time of retry. Is this caused by that 
> in KRaft protocal, Kafka doesn't not elect leaders immediately after a new 
> topic created but rather do that on-demand after receiving the first message 
> on the topic? If this is the root cause, Is there a way to ask Kafka to elect 
> the leader after creating the topic?
> {code:java}
> // code placeholder
> ./bin/trogdor.sh agent -n node0 -c ./config/trogdor.conf --exec 
> ./tests/spec/simple_produce_bench.json
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/xyang/code/kafka/tools/build/dependant-libs-2.13.12/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xyang/code/kafka/trogdor/build/dependant-libs-2.13.12/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
> Oct 12, 2023 12:30:50 AM org.glassfish.jersey.server.wadl.WadlFeature 
> configure
> WARNING: JAXBContext implementation could not be found. WADL feature is 
> disabled.
> Oct 12, 2023 12:30:50 AM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider org.apache.kafka.trogdor.agent.AgentRestResource 
> registered in SERVER runtime does not implement any provider interfaces 
> applicable in the SERVER runtime. Due to constraint configuration problems 
> the provider org.apache.kafka.trogdor.agent.AgentRestResource will be ignored.
> Waiting for completion of task:{
>   "class" : "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
>   "startMs" : 1697070650540,
>   "durationMs" : 1000,
>   "producerNode" : "node0",
>   "bootstrapServers" : "localhost:9092",
>   "targetMessagesPerSec" : 1,
>   "maxMessages" : 5,
>   "keyGenerator" : {
>     "type" : "sequential",
>     "size" : 4,
>     "startOffset" : 0
>   },
>   "valueGenerator" : {
>     "type" : "constant",
>     "size" : 512,
>     "value" : 
> "AAA="
>   },
>   "activeTopics" : {
>     "foo[1-3]" : {
>       "numPartitions" : 10,
>       "replicationFactor" : 1
>     }
>   },
>   "inactiveTopics" : {
>     "foo[4-5

[jira] [Commented] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with an empty disk

2023-09-25 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768787#comment-17768787
 ] 

Ron Dagostino commented on KAFKA-15495:
---

Yes.  The Disk UUID from JBOD is not needed for this because ELR takes care of 
it as described above.  It feels right to have the Disk UUID anyway since it is 
a simple concept that is easy to understand and reason about, and we can 
consider it as part of "defense in depth" -- it doesn't hurt to get multiple 
signals.  Also we will probably need the Disk UUID for Raft, which doesn't use 
ISR, and ELR won't help there.

This ticket will likely get resolved when [Eligible Leader 
Replicas|https://issues.apache.org/jira/browse/KAFKA-15332] gets resolved.

> KRaft partition truncated when the only ISR member restarts with an empty disk
> --
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --boots

[jira] [Commented] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with an empty disk

2023-09-25 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768674#comment-17768674
 ] 

Ron Dagostino commented on KAFKA-15495:
---

Thanks, Ismael.  Yes, I read [the 
KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas]
 and [Jack's blog post about 
it|https://jack-vanlightly.com/blog/2023/8/17/kafka-kip-966-fixing-the-last-replica-standing-issue|]
 last night after I posted this, and I have asked some folks who have been 
involved in that discussion if the broker epoch communicated in the broker 
registration request via the clean shutdown file might also serve to give the 
controller a signal that the disk is new.  I'll comment more here soon.

> KRaft partition truncated when the only ISR member restarts with an empty disk
> --
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --boots

[jira] [Updated] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with an empty disk

2023-09-24 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Description: 
Assume a topic-partition in KRaft has just a single leader replica in the ISR.  
Assume next that this replica goes offline.  This replica's log will define the 
contents of that partition when the replica restarts, which is correct 
behavior.  However, assume now that the replica has a disk failure, and we then 
replace the failed disk with a new, empty disk that we also format with the 
storage tool so it has the correct cluster ID.  If we then restart the broker, 
the topic-partition will have no data in it, and any other replicas that might 
exist will truncate their logs to match, which results in data loss.  See below 
for a step-by-step demo of how to reproduce this.

[KIP-858: Handle JBOD broker disk failure in 
KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
 introduces the concept of a Disk UUID that we can use to solve this problem.  
Specifically, when the leader restarts with an empty (but correctly-formatted) 
disk, the actual UUID associated with the disk will be different.  The 
controller will notice upon broker re-registration that its disk UUID differs 
from what was previously registered.  Right now we have no way of detecting 
this situation, but the disk UUID gives us that capability.


STEPS TO REPRODUCE:

Create a single broker cluster with single controller.  The standard files 
under config/kraft work well:

bin/kafka-storage.sh random-uuid
J8qXRwI-Qyi2G0guFTiuYw

#ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/controller.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker.properties

bin/kafka-server-start.sh config/kraft/controller.properties
bin/kafka-server-start.sh config/kraft/broker.properties

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
--partitions 1 --replication-factor 1

#create __consumer-offsets topics
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
--from-beginning
^C

#confirm that __consumer_offsets topic partitions are all created and on broker 
with node id 2
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

Now create 2 more brokers, with node IDs 11 and 12

cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
's/localhost:9092/localhost:9011/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
config/kraft/broker11.properties
cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
's/localhost:9092/localhost:9012/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
config/kraft/broker12.properties

#ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker11.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker12.properties

bin/kafka-server-start.sh config/kraft/broker11.properties
bin/kafka-server-start.sh config/kraft/broker12.properties

#create a topic with a single partition replicated on two brokers
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
--partitions 1 --replication-factor 2

#reassign partitions onto brokers with Node IDs 11 and 12
echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
"version":1}' > /tmp/reassign.json

bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
--reassignment-json-file /tmp/reassign.json --execute
bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
--reassignment-json-file /tmp/reassign.json --verify

#make preferred leader 11 the actual leader if it not
bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
--all-topic-partitions --election-type preferred

#Confirm both brokers are in ISR and 11 is the leader
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2Configs: segment.bytes=1073741824
Topic: foo2 Partition: 0Leader: 11  Replicas: 11,12 Isr: 
12,11


#Emit some messages to the topic
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic foo2
1
2
3
4
5
^C

#confirm we see the messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo2 
--from-beginning
1
2
3
4
5
^C

#Again confirm both brokers are in ISR, leader is 11
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCL

[jira] [Updated] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with an empty disk

2023-09-24 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Description: 
Assume a topic-partition in KRaft has just a single leader replica in the ISR.  
Assume next that this replica goes offline.  This replica's log will define the 
contents of that partition when the replica restarts, which is correct 
behavior.  However, assume now that the replica has a disk failure, and we then 
replace the failed disk with a new, empty disk that we also format with the 
storage tool so it has the correct cluster ID.  If we then restart the broker, 
the topic-partition will have no data in it, and any other replicas that might 
exist will truncate their logs to match, which results in data loss.  See below 
for a step-by-step demo of how to reproduce this.

[KIP-858: Handle JBOD broker disk failure in 
KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
 introduces the concept of a Disk UUID that we can use to solve this problem.  
Specifically, when the leader restarts with an empty (but correctly-formatted) 
disk, the actual UUID associated with the disk will be different.  The 
controller will notice upon broker re-registration that its disk UUID differs 
from what was previously registered.  Right now we have no way of detecting 
this situation, but the disk UUID gives us that capability.


STEPS TO REPRODUCE:

Create a single broker cluster with single controller.  The standard files 
under config/kraft work well:

bin/kafka-storage.sh random-uuid
J8qXRwI-Qyi2G0guFTiuYw

# ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/controller.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker.properties

bin/kafka-server-start.sh config/kraft/controller.properties
bin/kafka-server-start.sh config/kraft/broker.properties

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
--partitions 1 --replication-factor 1

# create __consumer-offsets topics
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
--from-beginning
^C

# confirm that __consumer_offsets topic partitions are all created and on 
broker with node id 2
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

Now create 2 more brokers, with node IDs 11 and 12

cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
's/localhost:9092/localhost:9011/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
config/kraft/broker11.properties
cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
's/localhost:9092/localhost:9012/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
config/kraft/broker12.properties

# ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker11.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker12.properties

bin/kafka-server-start.sh config/kraft/broker11.properties
bin/kafka-server-start.sh config/kraft/broker12.properties

# create a topic with a single partition replicated on two brokers
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
--partitions 1 --replication-factor 2

# reassign partitions onto brokers with Node IDs 11 and 12
cat > /tmp/reassign.json

[jira] [Updated] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with an empty disk

2023-09-24 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Description: 
Assume a topic-partition in KRaft has just a single leader replica in the ISR.  
Assume next that this replica goes offline.  This replica's log will define the 
contents of that partition when the replica restarts, which is correct 
behavior.  However, assume now that the replica has a disk failure, and we then 
replace the failed disk with a new, empty disk that we also format with the 
storage tool so it has the correct cluster ID.  If we then restart the broker, 
the topic-partition will have no data in it, and any other replicas that might 
exist will truncate their logs to match, which results in data loss.  See below 
for a step-by-step demo of how to reproduce this.

[KIP-858: Handle JBOD broker disk failure in 
KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
 introduces the concept of a Disk UUID that we can use to solve this problem.  
Specifically, when the leader restarts with an empty (but correctly-formatted) 
disk, the actual UUID associated with the disk will be different.  The 
controller will notice upon broker re-registration that its disk UUID differs 
from what was previously registered.  Right now we have no way of detecting 
this situation, but the disk UUID gives us that capability.


STEPS TO REPRODUCE:

Create a single broker cluster with single controller.  The standard files 
under config/kraft work well:

bin/kafka-storage.sh random-uuid
J8qXRwI-Qyi2G0guFTiuYw

# ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/controller.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker.properties

bin/kafka-server-start.sh config/kraft/controller.properties
bin/kafka-server-start.sh config/kraft/broker.properties

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
--partitions 1 --replication-factor 1

# create __consumer-offsets topics
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
--from-beginning
^C

# confirm that __consumer_offsets topic partitions are all created and on 
broker with node id 2
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

Now create 2 more brokers, with node IDs 3 and 4

cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
's/localhost:9092/localhost:9011/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
config/kraft/broker11.properties
cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
's/localhost:9092/localhost:9012/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
config/kraft/broker12.properties

# ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker11.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker12.properties

bin/kafka-server-start.sh config/kraft/broker11.properties
bin/kafka-server-start.sh config/kraft/broker12.properties

# create a topic with a single partition replicated on two brokers
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
--partitions 1 --replication-factor 2

# reassign partitions onto brokers with Node IDs 11 and 12
cat > /tmp/reassign.json

[jira] [Updated] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with an empty disk

2023-09-24 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Summary: KRaft partition truncated when the only ISR member restarts with 
an empty disk  (was: KRaft partition truncated when the only ISR member 
restarts with and empty disk)

> KRaft partition truncated when the only ISR member restarts with an empty disk
> --
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match.  See below 
> for a step-by-step demo of how to reproduce this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> # ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> # create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> # confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 3 and 4
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> # ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> # create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> # reassign partitions onto brokers with Node IDs 11 and 12
> cat > /tmp/reassign.json < {"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}
> DONE
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> # make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
> --all-topic-partitions --election-type preferred
> # Confirm both brokers are in ISR and 11 is the leader
> bin/kafka-topic

[jira] [Created] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with and empty disk

2023-09-24 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-15495:
-

 Summary: KRaft partition truncated when the only ISR member 
restarts with and empty disk
 Key: KAFKA-15495
 URL: https://issues.apache.org/jira/browse/KAFKA-15495
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.5.1, 3.4.1, 3.3.2, 3.6.0
Reporter: Ron Dagostino


Assume a topic-partition in KRaft has just a single leader replica in the ISR.  
Assume next that this replica goes offline.  This replica's log will define the 
contents of that partition when the replica restarts, which is correct 
behavior.  However, assume now that the replica has a disk failure, and we then 
replace the failed disk with a new, empty disk that we also format with the 
storage tool so it has the correct cluster ID.  If we then restart the broker, 
the topic-partition will have no data in it, and any other replicas that might 
exist will truncate their logs to match.  See below for a step-by-step demo of 
how to reproduce this.

[KIP-858: Handle JBOD broker disk failure in 
KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
 introduces the concept of a Disk UUID that we can use to solve this problem.  
Specifically, when the leader restarts with an empty (but correctly-formatted) 
disk, the actual UUID associated with the disk will be different.  The 
controller will notice upon broker re-registration that its disk UUID differs 
from what was previously registered.  Right now we have no way of detecting 
this situation, but the disk UUID gives us that capability.


STEPS TO REPRODUCE:

Create a single broker cluster with single controller.  The standard files 
under config/kraft work well:

bin/kafka-storage.sh random-uuid
J8qXRwI-Qyi2G0guFTiuYw

# ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/controller.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker.properties

bin/kafka-server-start.sh config/kraft/controller.properties
bin/kafka-server-start.sh config/kraft/broker.properties

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
--partitions 1 --replication-factor 1

# create __consumer-offsets topics
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
--from-beginning
^C

# confirm that __consumer_offsets topic partitions are all created and on 
broker with node id 2
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

Now create 2 more brokers, with node IDs 3 and 4

cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
's/localhost:9092/localhost:9011/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
config/kraft/broker11.properties
cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
's/localhost:9092/localhost:9012/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
config/kraft/broker12.properties

# ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker11.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker12.properties

bin/kafka-server-start.sh config/kraft/broker11.properties
bin/kafka-server-start.sh config/kraft/broker12.properties

# create a topic with a single partition replicated on two brokers
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
--partitions 1 --replication-factor 2

# reassign partitions onto brokers with Node IDs 11 and 12
cat > /tmp/reassign.json <

[jira] [Resolved] (KAFKA-15219) Support delegation tokens in KRaft

2023-08-19 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-15219.
---
Fix Version/s: 3.6.0
   Resolution: Fixed

> Support delegation tokens in KRaft
> --
>
> Key: KAFKA-15219
> URL: https://issues.apache.org/jira/browse/KAFKA-15219
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.6.0
>Reporter: Viktor Somogyi-Vass
>Assignee: Proven Provenzano
>Priority: Critical
> Fix For: 3.6.0
>
>
> Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft 
> enabled the way to supporting them in KIP-900 by adding SCRAM support but 
> delegation tokens still don't support KRaft.
> There are multiple issues:
> - TokenManager still would try to create tokens in Zookeeper. Instead of this 
> we should forward admin requests to the controller that would store them in 
> the metadata similarly to SCRAM. We probably won't need new protocols just 
> enveloping similarly to other existing controller requests.
> - TokenManager should run on Controller nodes only (or in mixed mode).
> - Integration tests will need to be adapted as well and parameterize them 
> with Zookeeper/KRaft.
> - Documentation needs to be improved to factor in KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-15098) KRaft migration does not proceed and broker dies if authorizer.class.name is set

2023-06-16 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15098:
--
Description: 
[ERROR] 2023-06-16 20:14:14,298 [main] kafka.Kafka$ - Exiting Kafka due to 
fatal exception
java.lang.IllegalArgumentException: requirement failed: ZooKeeper migration 
does not yet support authorizers. Remove authorizer.class.name before 
performing a migration.


  was:
java.lang.IllegalArgumentException: requirement failed: ZooKeeper migration 
does not yet support authorizers. Remove authorizer.class.name before 
performing a migration.



> KRaft migration does not proceed and broker dies if authorizer.class.name is 
> set
> 
>
> Key: KAFKA-15098
> URL: https://issues.apache.org/jira/browse/KAFKA-15098
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.5.0
>Reporter: Ron Dagostino
>Assignee: David Arthur
>Priority: Blocker
>
> [ERROR] 2023-06-16 20:14:14,298 [main] kafka.Kafka$ - Exiting Kafka due to 
> fatal exception
> java.lang.IllegalArgumentException: requirement failed: ZooKeeper migration 
> does not yet support authorizers. Remove authorizer.class.name before 
> performing a migration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-15098) KRaft migration does not proceed and broker dies if authorizer.class.name is set

2023-06-16 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-15098:
-

 Summary: KRaft migration does not proceed and broker dies if 
authorizer.class.name is set
 Key: KAFKA-15098
 URL: https://issues.apache.org/jira/browse/KAFKA-15098
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.5.0
Reporter: Ron Dagostino
Assignee: David Arthur


java.lang.IllegalArgumentException: requirement failed: ZooKeeper migration 
does not yet support authorizers. Remove authorizer.class.name before 
performing a migration.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-15039) Reduce logging level to trace in PartitionChangeBuilder.tryElection()

2023-05-30 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-15039:
-

 Summary: Reduce logging level to trace in 
PartitionChangeBuilder.tryElection()
 Key: KAFKA-15039
 URL: https://issues.apache.org/jira/browse/KAFKA-15039
 Project: Kafka
  Issue Type: Improvement
  Components: kraft
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.6.0


A CPU profile in a large cluster showed PartitionChangeBuilder.tryElection() 
taking significant CPU due to logging.  Decrease the logging statements in that 
method from debug level to trace to mitigate the impact of this CPU hog under 
normal operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14887) ZK session timeout can cause broker to shutdown

2023-04-20 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14887:
--
Fix Version/s: 2.7.3
   3.2.4
   3.1.3
   3.0.3
   3.4.1
   3.3.3
   2.8.3

> ZK session timeout can cause broker to shutdown
> ---
>
> Key: KAFKA-14887
> URL: https://issues.apache.org/jira/browse/KAFKA-14887
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.7.0, 2.8.0, 2.7.1, 3.1.0, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 
> 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 
> 3.3.2
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 2.7.3, 3.2.4, 3.1.3, 3.0.3, 3.5.0, 3.4.1, 3.3.3, 2.8.3
>
>
> We have the following code in FinalizedFeatureChangeListener.scala which will 
> exit regardless of the type of exception that is thrown when trying to 
> process feature changes:
> case e: Exception => {
>   error("Failed to process feature ZK node change event. The broker 
> will eventually exit.", e)
>   throw new FatalExitError(1)
> The issue here is that this does not distinguish between exceptions caused by 
> an inability to process a feature change and an exception caused by a 
> ZooKeeper session timeout.  We want to shut the broker down for the former 
> case, but we do NOT want to shut the broker down in the latter case; the 
> ZooKeeper session will eventually be reestablished, and we can continue 
> processing at that time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14887) ZK session timeout can cause broker to shutdown

2023-04-20 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14887:
--
Fix Version/s: 3.5.0

> ZK session timeout can cause broker to shutdown
> ---
>
> Key: KAFKA-14887
> URL: https://issues.apache.org/jira/browse/KAFKA-14887
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.7.0, 2.8.0, 2.7.1, 3.1.0, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 
> 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 
> 3.3.2
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.5.0
>
>
> We have the following code in FinalizedFeatureChangeListener.scala which will 
> exit regardless of the type of exception that is thrown when trying to 
> process feature changes:
> case e: Exception => {
>   error("Failed to process feature ZK node change event. The broker 
> will eventually exit.", e)
>   throw new FatalExitError(1)
> The issue here is that this does not distinguish between exceptions caused by 
> an inability to process a feature change and an exception caused by a 
> ZooKeeper session timeout.  We want to shut the broker down for the former 
> case, but we do NOT want to shut the broker down in the latter case; the 
> ZooKeeper session will eventually be reestablished, and we can continue 
> processing at that time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14735) Improve KRaft metadata image change performance at high topic counts

2023-04-18 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14735.
---
Resolution: Fixed

> Improve KRaft metadata image change performance at high topic counts
> 
>
> Key: KAFKA-14735
> URL: https://issues.apache.org/jira/browse/KAFKA-14735
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.6.0
>
>
> Performance of KRaft metadata image changes is currently O(<# of topics in 
> cluster>).  This means the amount of time it takes to create just a *single* 
> topic scales linearly with the number of topics in the entire cluster.  This 
> impact both controllers and brokers because both use the metadata image to 
> represent the KRaft metadata log.  The performance of these changes should 
> scale with the number of topics being changed -- so creating a single topic 
> should perform similarly regardless of the number of topics in the cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14890) Kafka initiates shutdown due to connectivity problem with Zookeeper and FatalExitError from ChangeNotificationProcessorThread

2023-04-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14890.
---
Resolution: Duplicate

Duplicate of https://issues.apache.org/jira/browse/KAFKA-14887

> Kafka initiates shutdown due to connectivity problem with Zookeeper and 
> FatalExitError from ChangeNotificationProcessorThread
> -
>
> Key: KAFKA-14890
> URL: https://issues.apache.org/jira/browse/KAFKA-14890
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Denis Razuvaev
>Priority: Major
>
> Hello, 
> We have faced several times the deadlock in Kafka, the similar issue is - 
> https://issues.apache.org/jira/browse/KAFKA-13544 
> The question - is it expected behavior that Kafka decided to shut down due to 
> connectivity problems with Zookeeper? Seems like it is related to the 
> inability to read data from */feature* Zk node and the 
> _ZooKeeperClientExpiredException_ thrown from _ZooKeeperClient_ class. This 
> exception is thrown and it is caught only in catch block of _doWork()_ method 
> in {_}ChangeNotificationProcessorThread{_}, and it leads to 
> {_}FatalExitError{_}. 
> This problem with shutdown is reproduced in the new versions of Kafka (which 
> already have fix regarding deadlock from 13544). 
> It is hard to write a synthetic test to reproduce problem, but it can be 
> reproduced locally via debug mode with the following steps: 
> 1) Start Zookeeper and start Kafka in debug mode. 
> 2) Emulate connectivity problem between Kafka and Zookeeper, for example 
> connection can be closed via Netcrusher library. 
> 3) Put a breakpoint in _updateLatestOrThrow()_ method in 
> _FeatureCacheUpdater_ class, before 
> _zkClient.getDataAndVersion(featureZkNodePath)_ line execution. 
> 4) Restore connection between Kafka and Zookeeper after session expiration. 
> Kafka execution should be stopped on the breakpoint.
> 5) Resume execution until Kafka starts to execute line 
> _zooKeeperClient.handleRequests(remainingRequests)_ in 
> _retryRequestsUntilConnected_ method in _KafkaZkClient_ class. 
> 6) Again emulate connectivity problem between Kafka and Zookeeper and wait 
> until session will be expired. 
> 7) Restore connection between Kafka and Zookeeper. 
> 8) Kafka begins shutdown process, due to: 
> _ERROR [feature-zk-node-event-process-thread]: Failed to process feature ZK 
> node change event. The broker will eventually exit. 
> (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)_
>  
> The following problems on the real environment can be caused by some network 
> problems and periodic disconnection and connection to the Zookeeper in a 
> short time period. 
> I started mail thread in 
> [https://lists.apache.org/thread/gbk4scwd8g7mg2tfsokzj5tjgrjrb9dw] regarding 
> this problem, but have no answers.
> For me it seems like defect, because Kafka initiates shutdown after restoring 
> connection between Kafka and Zookeeper, and should be fixed. 
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-14887) ZK session timeout can cause broker to shutdown

2023-04-10 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino reassigned KAFKA-14887:
-

Assignee: Ron Dagostino

> ZK session timeout can cause broker to shutdown
> ---
>
> Key: KAFKA-14887
> URL: https://issues.apache.org/jira/browse/KAFKA-14887
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.3.2, 3.3.1, 3.2.3, 3.2.2, 3.4.0, 3.2.1, 3.1.2, 3.0.2, 
> 3.3.0, 3.1.1, 3.2.0, 2.8.2, 3.0.1, 3.0.0, 2.8.1, 2.7.2, 3.1.0, 2.7.1, 2.8.0, 
> 2.7.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> We have the following code in FinalizedFeatureChangeListener.scala which will 
> exit regardless of the type of exception that is thrown when trying to 
> process feature changes:
> case e: Exception => {
>   error("Failed to process feature ZK node change event. The broker 
> will eventually exit.", e)
>   throw new FatalExitError(1)
> The issue here is that this does not distinguish between exceptions caused by 
> an inability to process a feature change and an exception caused by a 
> ZooKeeper session timeout.  We want to shut the broker down for the former 
> case, but we do NOT want to shut the broker down in the latter case; the 
> ZooKeeper session will eventually be reestablished, and we can continue 
> processing at that time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14887) ZK session timeout can cause broker to shutdown

2023-04-10 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14887:
-

 Summary: ZK session timeout can cause broker to shutdown
 Key: KAFKA-14887
 URL: https://issues.apache.org/jira/browse/KAFKA-14887
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 3.3.2, 3.3.1, 3.2.3, 3.2.2, 3.4.0, 3.2.1, 3.1.2, 3.0.2, 
3.3.0, 3.1.1, 3.2.0, 2.8.2, 3.0.1, 3.0.0, 2.8.1, 2.7.2, 3.1.0, 2.7.1, 2.8.0, 
2.7.0
Reporter: Ron Dagostino


We have the following code in FinalizedFeatureChangeListener.scala which will 
exit regardless of the type of exception that is thrown when trying to process 
feature changes:

case e: Exception => {
  error("Failed to process feature ZK node change event. The broker 
will eventually exit.", e)
  throw new FatalExitError(1)

The issue here is that this does not distinguish between exceptions caused by 
an inability to process a feature change and an exception caused by a ZooKeeper 
session timeout.  We want to shut the broker down for the former case, but we 
do NOT want to shut the broker down in the latter case; the ZooKeeper session 
will eventually be reestablished, and we can continue processing at that time.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14351) Implement controller mutation quotas in KRaft

2023-03-07 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14351.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Implement controller mutation quotas in KRaft
> -
>
> Key: KAFKA-14351
> URL: https://issues.apache.org/jira/browse/KAFKA-14351
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Colin McCabe
>Assignee: Ron Dagostino
>Priority: Major
>  Labels: kip-500
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14735) Improve KRaft metadata image change performance at high topic counts

2023-02-20 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14735:
-

 Summary: Improve KRaft metadata image change performance at high 
topic counts
 Key: KAFKA-14735
 URL: https://issues.apache.org/jira/browse/KAFKA-14735
 Project: Kafka
  Issue Type: Improvement
  Components: kraft
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.5.0


Performance of KRaft metadata image changes is currently O(<# of topics in 
cluster>).  This means the amount of time it takes to create just a *single* 
topic scales linearly with the number of topics in the entire cluster.  This 
impact both controllers and brokers because both use the metadata image to 
represent the KRaft metadata log.  The performance of these changes should 
scale with the number of topics being changed -- so creating a single topic 
should perform similarly regardless of the number of topics in the cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-14731) Upgrade ZooKeeper to 3.6.4

2023-02-17 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690531#comment-17690531
 ] 

Ron Dagostino edited comment on KAFKA-14731 at 2/17/23 6:39 PM:


Fixes in 3.6.4: 
https://issues.apache.org/jira/browse/ZOOKEEPER-4654?jql=project%20%3D%20ZOOKEEPER%20AND%20fixVersion%20%3D%203.6.4


was (Author: rndgstn):
Fixes in 3.6.4: 
https://issues.apache.org/jira/browse/ZOOKEEPER-4476?jql=project%20%3D%20ZOOKEEPER%20AND%20fixVersion%20%3D%203.6.4

> Upgrade ZooKeeper to 3.6.4
> --
>
> Key: KAFKA-14731
> URL: https://issues.apache.org/jira/browse/KAFKA-14731
> Project: Kafka
>  Issue Type: Task
>Affects Versions: 3.0.2, 3.1.2, 3.4.0, 3.2.3, 3.3.2, 3.5.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.2.4, 3.1.3, 3.0.3, 3.5.0, 3.4.1, 3.3.3
>
>
> We have https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14661 
> opened to upgrade ZooKeeper from 3.6.3 to 3.8.1, and that will likely be 
> actioned in time for 3.5.0.  But in the meantime, ZooKeeper 3.6.4 has been 
> released, so we should take the patch version bump in trunk now and also 
> apply the bump to the next patch releases of 3.0, 3.1, 3.2, 3.3, and 3.4.
> Note that KAFKA-14661 should *not* be applied to branches prior to trunk (and 
> presumably 3.5).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14731) Upgrade ZooKeeper to 3.6.4

2023-02-17 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690531#comment-17690531
 ] 

Ron Dagostino commented on KAFKA-14731:
---

Fixes in 3.6.4: 
https://issues.apache.org/jira/browse/ZOOKEEPER-4476?jql=project%20%3D%20ZOOKEEPER%20AND%20fixVersion%20%3D%203.6.4

> Upgrade ZooKeeper to 3.6.4
> --
>
> Key: KAFKA-14731
> URL: https://issues.apache.org/jira/browse/KAFKA-14731
> Project: Kafka
>  Issue Type: Task
>Affects Versions: 3.0.2, 3.1.2, 3.4.0, 3.2.3, 3.3.2, 3.5.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.2.4, 3.1.3, 3.0.3, 3.5.0, 3.4.1, 3.3.3
>
>
> We have https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14661 
> opened to upgrade ZooKeeper from 3.6.3 to 3.8.1, and that will likely be 
> actioned in time for 3.5.0.  But in the meantime, ZooKeeper 3.6.4 has been 
> released, so we should take the patch version bump in trunk now and also 
> apply the bump to the next patch releases of 3.0, 3.1, 3.2, 3.3, and 3.4.
> Note that KAFKA-14661 should *not* be applied to branches prior to trunk (and 
> presumably 3.5).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14731) Upgrade ZooKeeper to 3.6.4

2023-02-17 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14731:
-

 Summary: Upgrade ZooKeeper to 3.6.4
 Key: KAFKA-14731
 URL: https://issues.apache.org/jira/browse/KAFKA-14731
 Project: Kafka
  Issue Type: Task
Affects Versions: 3.3.2, 3.2.3, 3.4.0, 3.1.2, 3.0.2, 3.5.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.2.4, 3.1.3, 3.0.3, 3.5.0, 3.4.1, 3.3.3


We have https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14661 opened 
to upgrade ZooKeeper from 3.6.3 to 3.8.1, and that will likely be actioned in 
time for 3.5.0.  But in the meantime, ZooKeeper 3.6.4 has been released, so we 
should take the patch version bump in trunk now and also apply the bump to the 
next patch releases of 3.0, 3.1, 3.2, 3.3, and 3.4.

Note that KAFKA-14661 should *not* be applied to branches prior to trunk (and 
presumably 3.5).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14661) Upgrade Zookeeper to 3.8.1

2023-02-17 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14661:
--
Fix Version/s: (was: 3.4.1)
   (was: 3.3.3)

> Upgrade Zookeeper to 3.8.1 
> ---
>
> Key: KAFKA-14661
> URL: https://issues.apache.org/jira/browse/KAFKA-14661
> Project: Kafka
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Divij Vaidya
>Assignee: Christo Lolov
>Priority: Blocker
> Fix For: 3.5.0
>
>
> Current Zk version (3.6.x) supported by Apache Kafka has been EOL since 
> December 2022 [1]
> Users of Kafka are facing regulatory hurdles because of using a dependency 
> which is EOL, hence, I would suggest to upgrade this in all upcoming releases 
> (including patch releases of 3.3.x and 3.4.x versions).
> Some things to consider while upgrading (as pointed by [~ijuma] at [2]):
>  # If we upgrade the zk server to 3.8.1, what is the impact on the zk 
> clients. That is, what's the earliest zk client version that is supported by 
> the 3.8.x server?
>  # We need to ensure there are no regressions (particularly on the stability 
> front) when it comes to this upgrade. It would be good for someone to stress 
> test the system a bit with the new version and check if all works well.
> [1] [https://zookeeper.apache.org/releases.html] 
>  [2][https://github.com/apache/kafka/pull/12620#issuecomment-1409028650] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14711) kafaka-metadata-quorum.sh does not honor --command-config

2023-02-13 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14711.
---
Resolution: Fixed

> kafaka-metadata-quorum.sh does not honor --command-config
> -
>
> Key: KAFKA-14711
> URL: https://issues.apache.org/jira/browse/KAFKA-14711
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.4.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.4.1
>
>
> https://github.com/apache/kafka/pull/12951 accidentally eliminated support 
> for the `--command-config` option in the `kafka-metadata-quorum.sh` command.  
> This was an undetected regression in the 3.4.0 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14711) kafaka-metadata-quorum.sh does not honor --command-config

2023-02-13 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14711:
-

 Summary: kafaka-metadata-quorum.sh does not honor --command-config
 Key: KAFKA-14711
 URL: https://issues.apache.org/jira/browse/KAFKA-14711
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.4.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.4.1


https://github.com/apache/kafka/pull/12951 accidentally eliminated support for 
the `--command-config` option in the `kafka-metadata-quorum.sh` command.  This 
was an undetected regression in the 3.4.0 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-14351) Implement controller mutation quotas in KRaft

2023-01-13 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino reassigned KAFKA-14351:
-

Assignee: Ron Dagostino

> Implement controller mutation quotas in KRaft
> -
>
> Key: KAFKA-14351
> URL: https://issues.apache.org/jira/browse/KAFKA-14351
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Colin McCabe
>Assignee: Ron Dagostino
>Priority: Major
>  Labels: kip-500
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-21 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Fix Version/s: 3.4.0
   (was: 3.5.0)

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
> Fix For: 3.4.0, 3.3.2
>
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-21 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Affects Version/s: (was: 3.4.0)

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
> Fix For: 3.3.2, 3.5.0
>
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Affects Version/s: (was: 3.3.2)

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
> Fix For: 3.3.2, 3.5.0
>
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Fix Version/s: 3.3.2

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
> Fix For: 3.3.2, 3.5.0
>
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14392.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
> Fix For: 3.5.0
>
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Affects Version/s: 3.3.2

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Affects Version/s: 3.4.0

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Component/s: kraft

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Affects Version/s: (was: 3.4.0)

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-12-12 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Affects Version/s: 3.3.1
   3.3.0
   3.4.0

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14394) BrokerToControllerChannelManager has 2 separate timeouts

2022-11-22 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637404#comment-17637404
 ] 

Ron Dagostino commented on KAFKA-14394:
---

As per https://github.com/apache/kafka/pull/9564#discussion_r527326231, we do 
want infinite retries for when the broker sends AlterPartition requests.  
Closing this issue as the code is correct.

> BrokerToControllerChannelManager has 2 separate timeouts
> 
>
> Key: KAFKA-14394
> URL: https://issues.apache.org/jira/browse/KAFKA-14394
> Project: Kafka
>  Issue Type: Task
>Reporter: Ron Dagostino
>Priority: Major
>
> BrokerToControllerChannelManager uses `config.controllerSocketTimeoutMs` as 
> its default `networkClientRetryTimeoutMs` in general, but it does accept a 
> second `retryTimeoutMs`, value -- and then there is exactly one place where 
> second timeout is used: within BrokerToControllerRequestThread.  Is this 
> second, separate timeout actually necessary, or is it a bug (in which case 
> the two timeouts should be the same).  Closely related to this is the case of 
> AlterPartitionManager, which sends Long.MAX_VALUE as the retryTimeoutMs value 
> when it instantiates its instance of BrokerToControllerChannelManager.  Is 
> this Long.MAX_VALUE correct, when in fact `config.controllerSocketTimeoutMs` 
> is being used as the other timeout?
> This is related to 
> https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14392 and the 
> associated PR, https://github.com/apache/kafka/pull/12856



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14394) BrokerToControllerChannelManager has 2 separate timeouts

2022-11-22 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14394.
---
Resolution: Not A Problem

> BrokerToControllerChannelManager has 2 separate timeouts
> 
>
> Key: KAFKA-14394
> URL: https://issues.apache.org/jira/browse/KAFKA-14394
> Project: Kafka
>  Issue Type: Task
>Reporter: Ron Dagostino
>Priority: Major
>
> BrokerToControllerChannelManager uses `config.controllerSocketTimeoutMs` as 
> its default `networkClientRetryTimeoutMs` in general, but it does accept a 
> second `retryTimeoutMs`, value -- and then there is exactly one place where 
> second timeout is used: within BrokerToControllerRequestThread.  Is this 
> second, separate timeout actually necessary, or is it a bug (in which case 
> the two timeouts should be the same).  Closely related to this is the case of 
> AlterPartitionManager, which sends Long.MAX_VALUE as the retryTimeoutMs value 
> when it instantiates its instance of BrokerToControllerChannelManager.  Is 
> this Long.MAX_VALUE correct, when in fact `config.controllerSocketTimeoutMs` 
> is being used as the other timeout?
> This is related to 
> https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14392 and the 
> associated PR, https://github.com/apache/kafka/pull/12856



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14394) BrokerToControllerChannelManager has 2 separate timeouts

2022-11-16 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14394:
-

 Summary: BrokerToControllerChannelManager has 2 separate timeouts
 Key: KAFKA-14394
 URL: https://issues.apache.org/jira/browse/KAFKA-14394
 Project: Kafka
  Issue Type: Task
Reporter: Ron Dagostino


BrokerToControllerChannelManager uses `config.controllerSocketTimeoutMs` as its 
default `networkClientRetryTimeoutMs` in general, but it does accept a second 
`retryTimeoutMs`, value -- and then there is exactly one place where second 
timeout is used: within BrokerToControllerRequestThread.  Is this second, 
separate timeout actually necessary, or is it a bug (in which case the two 
timeouts should be the same).  Closely related to this is the case of 
AlterPartitionManager, which sends Long.MAX_VALUE as the retryTimeoutMs value 
when it instantiates its instance of BrokerToControllerChannelManager.  Is this 
Long.MAX_VALUE correct, when in fact `config.controllerSocketTimeoutMs` is 
being used as the other timeout?

This is related to 
https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14392 and the 
associated PR, https://github.com/apache/kafka/pull/12856



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-11-15 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634471#comment-17634471
 ] 

Ron Dagostino commented on KAFKA-14392:
---

One possibility is to continue to use `controller.socket.timeout.ms` as is 
currently being done but then update the documentation to make this clear -- 
unfortunately the default value for `controller.socket.timeout.ms` is 30 
seconds, whereas the default value for `broker.session.timeout.ms` is 9 seconds.

Another possibility is to use the value passed into the Broker-to-Controller 
channel manager.  For the broker's heartbeat thread, this is 
`broker.heartbeat.interval.ms`, which defaults to 2 seconds.

The latter seems better -- it requires no change to any configs and better 
reflects the desire on the broker side, which is to basically cancel the 
request if it doesn't succeed within the heartbeat period we are using and 
simply try again.

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms

2022-11-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14392:
--
Summary: KRaft broker heartbeat timeout should not exceed 
broker.session.timeout.ms  (was: KRaft should comment 
controller.socket.timeout.ms <= broker.session.timeout.ms)

> KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
> --
>
> Key: KAFKA-14392
> URL: https://issues.apache.org/jira/browse/KAFKA-14392
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> KRaft brokers maintain their liveness in the cluster by sending 
> BROKER_HEARTBEAT requests to the active controller; the active controller 
> fences a broker if it doesn't receive a heartbeat request from that broker 
> within the period defined by `broker.session.timeout.ms`.  The broker should 
> use a request timeout for its BROKER_HEARTBEAT requests that is not larger 
> than the session timeout being used by the controller; doing so creates the 
> possibility that upon controller failover the broker might not cancel an 
> existing heartbeat request in time and then subsequently heartbeat to the new 
> controller to maintain an uninterrupted session in the cluster.  In other 
> words, a failure of the active controller could result in under-replicated 
> (or under-min ISR) partitions simply due to a delay in brokers heartbeating 
> to the new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14392) KRaft should comment controller.socket.timeout.ms <= broker.session.timeout.ms

2022-11-15 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14392:
-

 Summary: KRaft should comment controller.socket.timeout.ms <= 
broker.session.timeout.ms
 Key: KAFKA-14392
 URL: https://issues.apache.org/jira/browse/KAFKA-14392
 Project: Kafka
  Issue Type: Improvement
Reporter: Ron Dagostino
Assignee: Ron Dagostino


KRaft brokers maintain their liveness in the cluster by sending 
BROKER_HEARTBEAT requests to the active controller; the active controller 
fences a broker if it doesn't receive a heartbeat request from that broker 
within the period defined by `broker.session.timeout.ms`.  The broker should 
use a request timeout for its BROKER_HEARTBEAT requests that is not larger than 
the session timeout being used by the controller; doing so creates the 
possibility that upon controller failover the broker might not cancel an 
existing heartbeat request in time and then subsequently heartbeat to the new 
controller to maintain an uninterrupted session in the cluster.  In other 
words, a failure of the active controller could result in under-replicated (or 
under-min ISR) partitions simply due to a delay in brokers heartbeating to the 
new controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14371) quorum-state file contains empty/unused clusterId field

2022-11-09 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14371:
-

 Summary: quorum-state file contains empty/unused clusterId field
 Key: KAFKA-14371
 URL: https://issues.apache.org/jira/browse/KAFKA-14371
 Project: Kafka
  Issue Type: Improvement
Reporter: Ron Dagostino


The KRaft controller's quorum-state file 
`$LOG_DIR/__cluster_metadata-0/quorum-state` contains an empty clusterId value. 
 This value is never non-empty, and it is never used after it is written and 
then subsequently read.  This is a cosmetic issue; it would be best if this 
value did not exist there.  The cluster ID already exists in the 
`$LOG_DIR/meta.properties` file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14195) Fix KRaft AlterConfig policy usage for Legacy/Full case

2022-09-01 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14195:
-

 Summary: Fix KRaft AlterConfig policy usage for Legacy/Full case
 Key: KAFKA-14195
 URL: https://issues.apache.org/jira/browse/KAFKA-14195
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.3
Reporter: Ron Dagostino
Assignee: Ron Dagostino


The fix for https://issues.apache.org/jira/browse/KAFKA-14039 adjusted the 
invocation of the alter configs policy check in KRaft to match the behavior in 
ZooKeeper, which is to only provide the configs that were explicitly sent in 
the request. While the code was correct for the incremental alter configs case, 
the code actually included the implicit deletions for the 
legacy/non-incremental alter configs case, and those implicit deletions are not 
included in the ZooKeeper-based invocation. The implicit deletions should not 
be passed in the legacy case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-14051) KRaft remote controllers do not create metrics reporters

2022-08-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-14051.
---
Resolution: Fixed

> KRaft remote controllers do not create metrics reporters
> 
>
> Key: KAFKA-14051
> URL: https://issues.apache.org/jira/browse/KAFKA-14051
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.3
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> KRaft remote controllers (KRaft nodes with the configuration value 
> process.roles=controller) do not create the configured metrics reporters 
> defined by the configuration key metric.reporters.  The reason is because 
> KRaft remote controllers are not wired up for dynamic config changes, and the 
> creation of the configured metric reporters actually happens during the 
> wiring up of the broker for dynamic reconfiguration, in the invocation of 
> DynamicBrokerConfig.addReconfigurables(KafkaBroker).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14105) Remove quorum.all_non_upgrade for system tests

2022-07-25 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14105:
-

 Summary: Remove quorum.all_non_upgrade for system tests
 Key: KAFKA-14105
 URL: https://issues.apache.org/jira/browse/KAFKA-14105
 Project: Kafka
  Issue Type: Task
  Components: kraft, system tests
Reporter: Ron Dagostino


We defined `all_non_upgrade = [zk, remote_kraft]` in `quorum.py` to encapsulate 
the quorum(s) that we want system tests to generally run with when they are 
unrelated to upgrading.  The idea was that we would just annotate tests with 
that and then we would be able to change the definition of it as we move 
through and beyond the KRaft bridge release.  But it is confusing, and 
search-and-replace is cheap -- especially if we are only doing it once or twice 
over the course of the project.  So we should eliminate the definition of 
`quorum.all_non_upgrade` (which was intended to be mutable over the course of 
the project) in favor of something like `zk_and_remote_kraft`, which will 
forever list ZK and REMOTE_KRAFT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14057) Support dynamic reconfiguration in KRaft remote controllers

2022-07-08 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14057:
-

 Summary: Support dynamic reconfiguration in KRaft remote 
controllers
 Key: KAFKA-14057
 URL: https://issues.apache.org/jira/browse/KAFKA-14057
 Project: Kafka
  Issue Type: Task
Reporter: Ron Dagostino


We currently do not support dynamic reconfiguration of KRaft remote 
controllers.  We only wire up brokers and react to metadata log changes there.  
We do no such wiring or reacting in a node where process.roles=controller.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14057) Support dynamic reconfiguration in KRaft remote controllers

2022-07-08 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-14057:
--
Description: 
We currently do not support dynamic reconfiguration of KRaft remote 
controllers.  We only wire up brokers and react to metadata log changes there.  
We do no such wiring or reacting in a node where process.roles=controller.  
Related to https://issues.apache.org/jira/browse/KAFKA-14051.


  was:
We currently do not support dynamic reconfiguration of KRaft remote 
controllers.  We only wire up brokers and react to metadata log changes there.  
We do no such wiring or reacting in a node where process.roles=controller.



> Support dynamic reconfiguration in KRaft remote controllers
> ---
>
> Key: KAFKA-14057
> URL: https://issues.apache.org/jira/browse/KAFKA-14057
> Project: Kafka
>  Issue Type: Task
>Reporter: Ron Dagostino
>Priority: Major
>
> We currently do not support dynamic reconfiguration of KRaft remote 
> controllers.  We only wire up brokers and react to metadata log changes 
> there.  We do no such wiring or reacting in a node where 
> process.roles=controller.  Related to 
> https://issues.apache.org/jira/browse/KAFKA-14051.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14056) Test reading of old messages formats in ZK-to-KRaft upgrade test

2022-07-08 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14056:
-

 Summary: Test reading of old messages formats in ZK-to-KRaft 
upgrade test
 Key: KAFKA-14056
 URL: https://issues.apache.org/jira/browse/KAFKA-14056
 Project: Kafka
  Issue Type: Task
  Components: kraft
Reporter: Ron Dagostino


Whenever we support ZK-to-KRaft upgrade we must confirm that we can still read 
messages with an older message format.  We can no longer write such messages as 
of IBP 3.0 (which is the minimum supported with KRaft), but we must still 
support reading such messages with KRaft.  Therefore, the only way to test this 
would be to write the messages with a non-KRaft cluster, upgrade to KRaft, and 
then confirm we can read those messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-14051) KRaft remote controllers do not create metrics reporters

2022-07-06 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino reassigned KAFKA-14051:
-

Assignee: Ron Dagostino

> KRaft remote controllers do not create metrics reporters
> 
>
> Key: KAFKA-14051
> URL: https://issues.apache.org/jira/browse/KAFKA-14051
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.3
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> KRaft remote controllers (KRaft nodes with the configuration value 
> process.roles=controller) do not create the configured metrics reporters 
> defined by the configuration key metric.reporters.  The reason is because 
> KRaft remote controllers are not wired up for dynamic config changes, and the 
> creation of the configured metric reporters actually happens during the 
> wiring up of the broker for dynamic reconfiguration, in the invocation of 
> DynamicBrokerConfig.addReconfigurables(KafkaBroker).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14051) KRaft remote controllers do not create metrics reporters

2022-07-06 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-14051:
-

 Summary: KRaft remote controllers do not create metrics reporters
 Key: KAFKA-14051
 URL: https://issues.apache.org/jira/browse/KAFKA-14051
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.3
Reporter: Ron Dagostino


KRaft remote controllers (KRaft nodes with the configuration value 
process.roles=controller) do not create the configured metrics reporters 
defined by the configuration key metric.reporters.  The reason is because KRaft 
remote controllers are not wired up for dynamic config changes, and the 
creation of the configured metric reporters actually happens during the wiring 
up of the broker for dynamic reconfiguration, in the invocation of 
DynamicBrokerConfig.addReconfigurables(KafkaBroker).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-13582) `TestVerifiableProducer. test_multiple_kraft_security_protocols` consistently fails

2022-01-10 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472008#comment-17472008
 ] 

Ron Dagostino commented on KAFKA-13582:
---

This is purely a test configuration issue.  An example error message:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: 
Controller listener with name CONTROLLER_SASL_SSL defined in 
controller.listener.names not found in listener.security.protocol.map  (an 
explicit security mapping for each controller listener is required if 
listener.security.protocol.map is non-empty, or if there are security protocols 
other than PLAINTEXT in use)

> `TestVerifiableProducer. test_multiple_kraft_security_protocols` consistently 
> fails
> ---
>
> Key: KAFKA-13582
> URL: https://issues.apache.org/jira/browse/KAFKA-13582
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: David Jacot
>Priority: Blocker
>
> `TestVerifiableProducer. test_multiple_kraft_security_protocols` consistently 
> fails in 3.1 and trunk: 
> [http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2021-12-27--001.system-test-kafka-3.1--1640613507--confluentinc--3.1]–3527156ac3/report.html.
> It seems that the system test does not comply with the changes made in 
> [https://github.com/apache/kafka/commit/36cc3dc2589ef279add3de59c6e7c4548e264eed.]
>  We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (KAFKA-13502) Support configuring BROKER_LOGGER on controller-only KRaft nodes

2021-12-17 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461478#comment-17461478
 ] 

Ron Dagostino commented on KAFKA-13502:
---

This is one aspect of the broader problem as described in 
https://issues.apache.org/jira/browse/KAFKA-13552

> Support configuring BROKER_LOGGER on controller-only KRaft nodes
> 
>
> Key: KAFKA-13502
> URL: https://issues.apache.org/jira/browse/KAFKA-13502
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Colin McCabe
>Priority: Major
>  Labels: kip-500
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (KAFKA-13552) Unable to dynamically change broker log levels on KRaft

2021-12-17 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461476#comment-17461476
 ] 

Ron Dagostino commented on KAFKA-13552:
---

[~dengziming] Thanks for pointing that out.  Although there is not much in that 
ticket, it appears to address controller-only nodes, whereas this ticket 
indicates that no KRaft node (broker-only, controller-only, or combined 
broker+controller) supports dynamic changes to the log levels .  I updated the 
description of this ticket to point to that one since it is just one aspect of 
the problem.

> Unable to dynamically change broker log levels on KRaft
> ---
>
> Key: KAFKA-13552
> URL: https://issues.apache.org/jira/browse/KAFKA-13552
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.1.0, 3.0.0
>Reporter: Ron Dagostino
>Priority: Major
>
> It is currently not possible to dynamically change the log level in KRaft.  
> For example:
> kafka-configs.sh --bootstrap-server  --alter --add-config 
> "kafka.server.ReplicaManager=DEBUG" --entity-type broker-loggers 
> --entity-name 0
> Results in:
> org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource 
> type BROKER_LOGGER.
> The code to process this request is in ZkAdminManager.alterLogLevelConfigs(). 
>  This needs to be moved out of there, and the functionality has to be 
> processed locally on the broker instead of being forwarded to the KRaft 
> controller.
> It is also an open question as to how we can dynamically alter log levels for 
> a remote KRaft controller.  Connecting directly to it is one possible 
> solution, but that may not be desirable since generally connecting directly 
> to the controller is not necessary.  The ticket for this particular spect of 
> the issue is https://issues.apache.org/jira/browse/KAFKA-13502



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13552) Unable to dynamically change broker log levels on KRaft

2021-12-17 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13552:
--
Description: 
It is currently not possible to dynamically change the log level in KRaft.  For 
example:

kafka-configs.sh --bootstrap-server  --alter --add-config 
"kafka.server.ReplicaManager=DEBUG" --entity-type broker-loggers --entity-name 0

Results in:

org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource 
type BROKER_LOGGER.

The code to process this request is in ZkAdminManager.alterLogLevelConfigs().  
This needs to be moved out of there, and the functionality has to be processed 
locally on the broker instead of being forwarded to the KRaft controller.

It is also an open question as to how we can dynamically alter log levels for a 
remote KRaft controller.  Connecting directly to it is one possible solution, 
but that may not be desirable since generally connecting directly to the 
controller is not necessary.  The ticket for this particular spect of the issue 
is https://issues.apache.org/jira/browse/KAFKA-13502

  was:
It is currently not possible to dynamically change the log level in KRaft.  For 
example:

kafka-configs.sh --bootstrap-server  --alter --add-config 
"kafka.server.ReplicaManager=DEBUG" --entity-type broker-loggers --entity-name 0

Results in:

org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource 
type BROKER_LOGGER.

The code to process this request is in ZkAdminManager.alterLogLevelConfigs().  
This needs to be moved out of there, and the functionality has to be processed 
locally on the broker instead of being forwarded to the KRaft controller.

It is also an open question as to how we can dynamically alter log levels for a 
remote KRaft controller.  Connecting directly to it is one possible solution, 
but that may not be desirable since generally connecting directly to the 
controller is not necessary.


> Unable to dynamically change broker log levels on KRaft
> ---
>
> Key: KAFKA-13552
> URL: https://issues.apache.org/jira/browse/KAFKA-13552
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.1.0, 3.0.0
>Reporter: Ron Dagostino
>Priority: Major
>
> It is currently not possible to dynamically change the log level in KRaft.  
> For example:
> kafka-configs.sh --bootstrap-server  --alter --add-config 
> "kafka.server.ReplicaManager=DEBUG" --entity-type broker-loggers 
> --entity-name 0
> Results in:
> org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource 
> type BROKER_LOGGER.
> The code to process this request is in ZkAdminManager.alterLogLevelConfigs(). 
>  This needs to be moved out of there, and the functionality has to be 
> processed locally on the broker instead of being forwarded to the KRaft 
> controller.
> It is also an open question as to how we can dynamically alter log levels for 
> a remote KRaft controller.  Connecting directly to it is one possible 
> solution, but that may not be desirable since generally connecting directly 
> to the controller is not necessary.  The ticket for this particular spect of 
> the issue is https://issues.apache.org/jira/browse/KAFKA-13502



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (KAFKA-13552) Unable to dynamically change broker log levels on KRaft

2021-12-16 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13552:
-

 Summary: Unable to dynamically change broker log levels on KRaft
 Key: KAFKA-13552
 URL: https://issues.apache.org/jira/browse/KAFKA-13552
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.0.0, 3.1.0
Reporter: Ron Dagostino


It is currently not possible to dynamically change the log level in KRaft.  For 
example:

kafka-configs.sh --bootstrap-server  --alter --add-config 
"kafka.server.ReplicaManager=DEBUG" --entity-type broker-loggers --entity-name 0

Results in:

org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource 
type BROKER_LOGGER.

The code to process this request is in ZkAdminManager.alterLogLevelConfigs().  
This needs to be moved out of there, and the functionality has to be processed 
locally on the broker instead of being forwarded to the KRaft controller.

It is also an open question as to how we can dynamically alter log levels for a 
remote KRaft controller.  Connecting directly to it is one possible solution, 
but that may not be desirable since generally connecting directly to the 
controller is not necessary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Priority: Blocker  (was: Major)

> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Blocker
> Fix For: 3.1.0
>
>
> We need to tighten the configuration constraints/checks related to KRaft 
> configs because the current checks do not eliminate illegal configuration 
> combinations.  Specifically, we need to add the following constraints:
> * controller.listener.names is required to be empty for the non-KRaft (i.e. 
> ZooKeeper) case. A ZooKeeper-based cluster that sets this config will fail to 
> restart until this config is removed.  This generally should not be occurring 
> -- nobody should be setting KRaft-specific configs in a ZooKeeper-based 
> cluster -- but we currently do not prevent it from happening.
> * There must be no advertised listeners when running just a KRaft controller 
> (i.e. when process.roles=controller). This means neither listeners nor 
> advertised.listeners (if the latter is explicitly defined) can contain a 
> listener that does not also appear in controller.listener.names.
> * When running a KRaft broker (i.e. when process.roles=broker or 
> process.roles=broker,controller), advertised listeners must not include any 
> listeners appearing in controller.listener.names.
> * When running a KRaft controller (i.e. when process.roles=controller or 
> process.roles=broker,controller) controller.listener.names must be non-empty 
> and every one must appear in listeners
> * When running just a KRaft broker (i.e. when process.roles=broker) 
> controller.listener.names must be non-empty and none of them can appear in 
> listeners. This is currently checked indirectly, but the indirect checks do 
> not catch all cases.  We will check directly.
> * When running just a KRaft broker we log a warning if more than one entry 
> appears in controller.listener.names because only the first entry is used.
> In addition to the above additional constraints, we should also map the 
> CONTROLLER listener name to the PLAINTEXT security protocol by default when 
> using KRaft -- this would be a very helpful convenience.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Description: 
We need to tighten the configuration constraints/checks related to KRaft 
configs because the current checks do not eliminate illegal configuration 
combinations.  Specifically, we need to add the following constraints:

* controller.listener.names is required to be empty for the non-KRaft (i.e. 
ZooKeeper) case. A ZooKeeper-based cluster that sets this config will fail to 
restart until this config is removed.  This generally should not be occurring 
-- nobody should be setting KRaft-specific configs in a ZooKeeper-based cluster 
-- but we currently do not prevent it from happening.
* There must be no advertised listeners when running just a KRaft controller 
(i.e. when process.roles=controller). This means neither listeners nor 
advertised.listeners (if the latter is explicitly defined) can contain a 
listener that does not also appear in controller.listener.names.
* When running a KRaft broker (i.e. when process.roles=broker or 
process.roles=broker,controller), advertised listeners must not include any 
listeners appearing in controller.listener.names.
* When running a KRaft controller (i.e. when process.roles=controller or 
process.roles=broker,controller) controller.listener.names must be non-empty 
and every one must appear in listeners
* When running just a KRaft broker (i.e. when process.roles=broker) 
controller.listener.names must be non-empty and none of them can appear in 
listeners. This is currently checked indirectly, but the indirect checks do not 
catch all cases.  We will check directly.
* When running just a KRaft broker we log a warning if more than one entry 
appears in controller.listener.names because only the first entry is used.

In addition to the above additional constraints, we should also map the 
CONTROLLER listener name to the PLAINTEXT security protocol by default when 
using KRaft -- this would be a very helpful convenience.


  was:The controller.listener.names config is currently checked for existence 
when the process.roles contains the controller role (i.e. 
process.roles=controller or process.roles=broker,contrtoller); it is not 
checked for existence when process.roles=broker.  However, KRaft brokers have 
to talk to KRaft controllers, of course, and they do so by taking the first 
entry in the controller.listener.names list.  Therefore, 
controller.listener.names is required in KRaft mode even when 
process.roles=broker.


> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> We need to tighten the configuration constraints/checks related to KRaft 
> configs because the current checks do not eliminate illegal configuration 
> combinations.  Specifically, we need to add the following constraints:
> * controller.listener.names is required to be empty for the non-KRaft (i.e. 
> ZooKeeper) case. A ZooKeeper-based cluster that sets this config will fail to 
> restart until this config is removed.  This generally should not be occurring 
> -- nobody should be setting KRaft-specific configs in a ZooKeeper-based 
> cluster -- but we currently do not prevent it from happening.
> * There must be no advertised listeners when running just a KRaft controller 
> (i.e. when process.roles=controller). This means neither listeners nor 
> advertised.listeners (if the latter is explicitly defined) can contain a 
> listener that does not also appear in controller.listener.names.
> * When running a KRaft broker (i.e. when process.roles=broker or 
> process.roles=broker,controller), advertised listeners must not include any 
> listeners appearing in controller.listener.names.
> * When running a KRaft controller (i.e. when process.roles=controller or 
> process.roles=broker,controller) controller.listener.names must be non-empty 
> and every one must appear in listeners
> * When running just a KRaft broker (i.e. when process.roles=broker) 
> controller.listener.names must be non-empty and none of them can appear in 
> listeners. This is currently checked indirectly, but the indirect checks do 
> not catch all cases.  We will check directly.
> * When running just a KRaft broker we log a warning if more than one entry 
> appears in controller.listener.names because only the first entry is used.
> In addition to the above additional constraints, we should also map the 
> CONTROLLER listener name to the PLAINTEXT security protocol by default when 
> using KRaft -- this would be a very helpful convenience.



--
This m

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Component/s: kraft

> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> The controller.listener.names config is currently checked for existence when 
> the process.roles contains the controller role (i.e. process.roles=controller 
> or process.roles=broker,contrtoller); it is not checked for existence when 
> process.roles=broker.  However, KRaft brokers have to talk to KRaft 
> controllers, of course, and they do so by taking the first entry in the 
> controller.listener.names list.  Therefore, controller.listener.names is 
> required in KRaft mode even when process.roles=broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Fix Version/s: 3.1.0

> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.1.0
>
>
> The controller.listener.names config is currently checked for existence when 
> the process.roles contains the controller role (i.e. process.roles=controller 
> or process.roles=broker,contrtoller); it is not checked for existence when 
> process.roles=broker.  However, KRaft brokers have to talk to KRaft 
> controllers, of course, and they do so by taking the first entry in the 
> controller.listener.names list.  Therefore, controller.listener.names is 
> required in KRaft mode even when process.roles=broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Summary: Tighten KRaft config checks/constraints  (was: 
controller.listener.names is required for all KRaft nodes, not just controllers)

> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.1.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> The controller.listener.names config is currently checked for existence when 
> the process.roles contains the controller role (i.e. process.roles=controller 
> or process.roles=broker,contrtoller); it is not checked for existence when 
> process.roles=broker.  However, KRaft brokers have to talk to KRaft 
> controllers, of course, and they do so by taking the first entry in the 
> controller.listener.names list.  Therefore, controller.listener.names is 
> required in KRaft mode even when process.roles=broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Affects Version/s: (was: 3.1.0)

> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> The controller.listener.names config is currently checked for existence when 
> the process.roles contains the controller role (i.e. process.roles=controller 
> or process.roles=broker,contrtoller); it is not checked for existence when 
> process.roles=broker.  However, KRaft brokers have to talk to KRaft 
> controllers, of course, and they do so by taking the first entry in the 
> controller.listener.names list.  Therefore, controller.listener.names is 
> required in KRaft mode even when process.roles=broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) Tighten KRaft config checks/constraints

2021-11-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Fix Version/s: (was: 3.1.0)

> Tighten KRaft config checks/constraints
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> The controller.listener.names config is currently checked for existence when 
> the process.roles contains the controller role (i.e. process.roles=controller 
> or process.roles=broker,contrtoller); it is not checked for existence when 
> process.roles=broker.  However, KRaft brokers have to talk to KRaft 
> controllers, of course, and they do so by taking the first entry in the 
> controller.listener.names list.  Therefore, controller.listener.names is 
> required in KRaft mode even when process.roles=broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13456) controller.listener.names is required for all KRaft nodes, not just controllers

2021-11-15 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13456:
--
Description: The controller.listener.names config is currently checked for 
existence when the process.roles contains the controller role (i.e. 
process.roles=controller or process.roles=broker,contrtoller); it is not 
checked for existence when process.roles=broker.  However, KRaft brokers have 
to talk to KRaft controllers, of course, and they do so by taking the first 
entry in the controller.listener.names list.  Therefore, 
controller.listener.names is required in KRaft mode even when 
process.roles=broker.  (was: The controller.listener.names config is currently 
checked for existence when the process.roles contains the controller role (i.e. 
process.roles=controller or process.roles=broker,contrtoller); it is not 
checked for existence when process.roles=broker.  However, KRaft brokers have 
to talk to KRaft controllers, of course, and they do so by taking the first 
entry in the controller.listener.names list.  Therefore, 
controller.listener.names is required in KRaft mode even when 
process.roles.broker.)

> controller.listener.names is required for all KRaft nodes, not just 
> controllers
> ---
>
> Key: KAFKA-13456
> URL: https://issues.apache.org/jira/browse/KAFKA-13456
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.1.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> The controller.listener.names config is currently checked for existence when 
> the process.roles contains the controller role (i.e. process.roles=controller 
> or process.roles=broker,contrtoller); it is not checked for existence when 
> process.roles=broker.  However, KRaft brokers have to talk to KRaft 
> controllers, of course, and they do so by taking the first entry in the 
> controller.listener.names list.  Therefore, controller.listener.names is 
> required in KRaft mode even when process.roles=broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (KAFKA-13456) controller.listener.names is required for all KRaft nodes, not just controllers

2021-11-15 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13456:
-

 Summary: controller.listener.names is required for all KRaft 
nodes, not just controllers
 Key: KAFKA-13456
 URL: https://issues.apache.org/jira/browse/KAFKA-13456
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.0.0, 2.8.0, 3.1.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino


The controller.listener.names config is currently checked for existence when 
the process.roles contains the controller role (i.e. process.roles=controller 
or process.roles=broker,contrtoller); it is not checked for existence when 
process.roles=broker.  However, KRaft brokers have to talk to KRaft 
controllers, of course, and they do so by taking the first entry in the 
controller.listener.names list.  Therefore, controller.listener.names is 
required in KRaft mode even when process.roles.broker.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (KAFKA-13140) KRaft brokers do not expose kafka.controller metrics, breaking backwards compatibility

2021-11-15 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443850#comment-17443850
 ] 

Ron Dagostino edited comment on KAFKA-13140 at 11/15/21, 2:25 PM:
--

No longer applies due to the adoption of [KIP-771: KRaft brokers without the 
"controller" role should not expose controller 
metrics|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188743985]


was (Author: rndgstn):
No longer applies due to the adoption of [KIP-7761: KRaft brokers without the 
"controller" role should not expose controller 
metrics|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188743985]

> KRaft brokers do not expose kafka.controller metrics, breaking backwards 
> compatibility
> --
>
> Key: KAFKA-13140
> URL: https://issues.apache.org/jira/browse/KAFKA-13140
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.2.0
>
>
> The following controller metrics are exposed on every broker in a 
> ZooKeeper-based (i.e. non-KRaft) cluster regardless of whether the broker is 
> the active controller or not, but these metrics are not exposed on KRaft 
> nodes that have process.roles=broker (i.e. KRaft nodes that do not implement 
> the controller role).  For backwards compatibility, KRaft nodes that are just 
> brokers should expose these metrics with values all equal to 0: just like 
> ZooKeeper-based brokers do when they are not the active controller.
> kafka.controller:type=KafkaController,name=ActiveControllerCount
> kafka.controller:type=KafkaController,name=GlobalTopicCount
> kafka.controller:type=KafkaController,name=GlobalPartitionCount
> kafka.controller:type=KafkaController,name=OfflinePartitionsCount
> kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (KAFKA-13270) Kafka may fail to connect to ZooKeeper, retry forever, and never start

2021-09-02 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13270:
--
Description: The implementation of 
https://issues.apache.org/jira/browse/ZOOKEEPER-3593 in ZooKeeper version 3.6.0 
decreased the default value for the ZooKeeper client's `jute.maxbuffer` 
configuration from 4MB to 1MB.  This can cause a problem if Kafka tries to 
retrieve a large amount of data across many znodes -- in such a case the 
ZooKeeper client will repeatedly emit a message of the form 
"java.io.IOException: Packet len <> is out of range" and the Kafka broker 
will never connect to ZooKeeper and fail to make progress on the startup 
sequence.  We can avoid the potential for this issue to occur by explicitly 
setting the value to 4MB whenever we create a new ZooKeeper client as long as 
no explicit value has been set via the `jute.maxbuffer` system property.  (was: 
The implementation of https://issues.apache.org/jira/browse/ZOOKEEPER-3593 in 
ZooKeeper version 3.6.0 decreased the default value for the ZooKeeper client's 
`jute.maxbuffer` configuration from 4MB to 1MB.  This can cause a problem if 
Kafka tries to retrieve a large amount of data across many znodes -- in such a 
case the ZooKeeper client will repeatedly emit a message of the form 
"java.io.IOException: Packet len <> is out of range" and the Kafka broker 
will never connect to ZooKeeper and fail make progress on the startup sequence. 
 We can avoid the potential for this issue to occur by explicitly setting the 
value to 4MB whenever we create a new ZooKeeper client as long as no explicit 
value has been set via the `jute.maxbuffer` system property.)

> Kafka may fail to connect to ZooKeeper, retry forever, and never start
> --
>
> Key: KAFKA-13270
> URL: https://issues.apache.org/jira/browse/KAFKA-13270
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The implementation of https://issues.apache.org/jira/browse/ZOOKEEPER-3593 in 
> ZooKeeper version 3.6.0 decreased the default value for the ZooKeeper 
> client's `jute.maxbuffer` configuration from 4MB to 1MB.  This can cause a 
> problem if Kafka tries to retrieve a large amount of data across many znodes 
> -- in such a case the ZooKeeper client will repeatedly emit a message of the 
> form "java.io.IOException: Packet len <> is out of range" and the Kafka 
> broker will never connect to ZooKeeper and fail to make progress on the 
> startup sequence.  We can avoid the potential for this issue to occur by 
> explicitly setting the value to 4MB whenever we create a new ZooKeeper client 
> as long as no explicit value has been set via the `jute.maxbuffer` system 
> property.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13270) Kafka may fail to connect to ZooKeeper, retry forever, and never start

2021-09-02 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13270:
-

 Summary: Kafka may fail to connect to ZooKeeper, retry forever, 
and never start
 Key: KAFKA-13270
 URL: https://issues.apache.org/jira/browse/KAFKA-13270
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.0.0


The implementation of https://issues.apache.org/jira/browse/ZOOKEEPER-3593 in 
ZooKeeper version 3.6.0 decreased the default value for the ZooKeeper client's 
`jute.maxbuffer` configuration from 4MB to 1MB.  This can cause a problem if 
Kafka tries to retrieve a large amount of data across many znodes -- in such a 
case the ZooKeeper client will repeatedly emit a message of the form 
"java.io.IOException: Packet len <> is out of range" and the Kafka broker 
will never connect to ZooKeeper and fail make progress on the startup sequence. 
 We can avoid the potential for this issue to occur by explicitly setting the 
value to 4MB whenever we create a new ZooKeeper client as long as no explicit 
value has been set via the `jute.maxbuffer` system property.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13224) broker.id does not appear in config's originals map when setting just node.id

2021-08-23 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13224:
-

 Summary: broker.id does not appear in config's originals map when 
setting just node.id
 Key: KAFKA-13224
 URL: https://issues.apache.org/jira/browse/KAFKA-13224
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino


Plugins may expect broker.id to exist as a key in the config's various 
originals()-related maps, but with KRaft we rely solely on node.id for the 
broker's ID, and with the Zk-based brokers we provide the option to specify 
node.id in addition to (or as a full replacement for) broker.id. 
 There are multiple problems related to this switch to node.id:

# We do not enforce consistency between explicitly-specified broker.id and 
node.id properties in the config -- it is entirely possible right now that we 
could set broker.id=0 and also set node.id=1, and the broker will use 1 for 
it's ID. This is confusing at best; the broker should detect this inconsistency 
and fail to start with a ConfigException.
# When node.id is set, both that value and any explicitly-set broker.id value 
will exist in the config's *originals()-related maps*. Downstream components 
are often configured based on these maps, and they may ask for the broker.id, 
so downstream components may be misconfigured if the values differ, or they may 
fail during configuration if no broker.id key exists in the map at all.
# The config's *values()-related maps* will contain either the 
explicitly-specified broker.id value or the default value of -1.  When node.id 
is set, both that value (which cannot be negative) and the (potentially -1) 
broker.id value will exist in the config's values()-related maps. Downstream 
components are often configured based on these maps, and they may ask for the 
broker.id, so downstream components may be misconfigured if the broker.id value 
differs from the broker's true ID.

The broker should detect inconsistency between explicitly-specified broker.id 
and node.id values and fail startup accordingly. It should also ensures that 
the config's originals()- and values()-related maps contain the same mapped 
values for both broker.id and node.id keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13219) BrokerState metric not working for KRaft clusters

2021-08-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13219:
--
Priority: Blocker  (was: Major)

> BrokerState metric not working for KRaft clusters
> -
>
> Key: KAFKA-13219
> URL: https://issues.apache.org/jira/browse/KAFKA-13219
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The BrokerState metric always has a value of 0, for NOT_RUNNING, in KRaft 
> clusters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13219) BrokerState metric not working for KRaft clusters

2021-08-23 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13219:
--
Fix Version/s: 3.0.0

> BrokerState metric not working for KRaft clusters
> -
>
> Key: KAFKA-13219
> URL: https://issues.apache.org/jira/browse/KAFKA-13219
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 3.0.0
>
>
> The BrokerState metric always has a value of 0, for NOT_RUNNING, in KRaft 
> clusters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13219) BrokerState metric not working for KRaft clusters

2021-08-19 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13219:
-

 Summary: BrokerState metric not working for KRaft clusters
 Key: KAFKA-13219
 URL: https://issues.apache.org/jira/browse/KAFKA-13219
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino


The BrokerState metric always has a value of 0, for NOT_RUNNING, in KRaft 
clusters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-13192) broker.id and node.id can be specified inconsistently

2021-08-11 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino reassigned KAFKA-13192:
-

Assignee: Ron Dagostino

> broker.id and node.id can be specified inconsistently
> -
>
> Key: KAFKA-13192
> URL: https://issues.apache.org/jira/browse/KAFKA-13192
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Major
>
> If both broker.id and node.id are set, and they are set inconsistently 
> (e.g.broker.id=0, node.id=1) then the value of node.id is used and the 
> broker.id value is left at the original value.  The server should detect this 
> inconsistency, throw a ConfigException, and fail to start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13192) broker.id and node.id can be specified inconsistently

2021-08-11 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13192:
-

 Summary: broker.id and node.id can be specified inconsistently
 Key: KAFKA-13192
 URL: https://issues.apache.org/jira/browse/KAFKA-13192
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Ron Dagostino


If both broker.id and node.id are set, and they are set inconsistently 
(e.g.broker.id=0, node.id=1) then the value of node.id is used and the 
broker.id value is left at the original value.  The server should detect this 
inconsistency, throw a ConfigException, and fail to start.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13140) KRaft brokers do not expose kafka.controller metrics, breaking backwards compatibility

2021-07-27 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13140:
-

 Summary: KRaft brokers do not expose kafka.controller metrics, 
breaking backwards compatibility
 Key: KAFKA-13140
 URL: https://issues.apache.org/jira/browse/KAFKA-13140
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 2.8.0, 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.1.0


The following controller metrics are exposed on every broker in a 
ZooKeeper-based (i.e. non-KRaft) cluster regardless of whether the broker is 
the active controller or not, but these metrics are not exposed on KRaft nodes 
that have process.roles=broker (i.e. KRaft nodes that do not implement the 
controller role).  For backwards compatibility, KRaft nodes that are just 
brokers should expose these metrics with values all equal to 0: just like 
ZooKeeper-based brokers do when they are not the active controller.

kafka.controller:type=KafkaController,name=ActiveControllerCount
kafka.controller:type=KafkaController,name=GlobalTopicCount
kafka.controller:type=KafkaController,name=GlobalPartitionCount
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13137) KRaft Controller Metric MBean names are incorrectly quoted

2021-07-26 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13137:
-

 Summary: KRaft Controller Metric MBean names are incorrectly quoted
 Key: KAFKA-13137
 URL: https://issues.apache.org/jira/browse/KAFKA-13137
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 2.8.0, 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.0.0


QuorumControllerMetrics is letting com.yammer.metrics.MetricName create the 
MBean names for all of the controller metrics, and that adds quotes.  We have 
typically used KafkaMetricsGroup to explicitly create the MBean name, and we do 
not add quotes there.  The controller metric names that are in common between 
the old and new controller must remain the same, but they are not.  For 
example, this non-KRaft MBean name:

kafka.controller:type=KafkaController,name=OfflinePartitionsCount

has morphed into this when using KRaft:

"kafka.controller":type="KafkaController",name="OfflinePartitionsCount"





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-13069) Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde

2021-07-22 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino resolved KAFKA-13069.
---
Resolution: Invalid

Flexible fields are sufficient as per KIP-590 VOTE email thread, so a magic 
number will not be needed.

> Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde
> 
>
> Key: KAFKA-13069
> URL: https://issues.apache.org/jira/browse/KAFKA-13069
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13069) Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde

2021-07-13 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380097#comment-17380097
 ] 

Ron Dagostino commented on KAFKA-13069:
---

See email sent to KIP-590 VOTE thread about this issue.

https://lists.apache.org/thread.html/r740287705459d7156dfe1f62ed76433ad3e4639ddb91bb79297e8a70%40%3Cdev.kafka.apache.org%3E


> Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde
> 
>
> Key: KAFKA-13069
> URL: https://issues.apache.org/jira/browse/KAFKA-13069
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13069) Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde

2021-07-13 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-13069:
--
Priority: Critical  (was: Major)

> Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde
> 
>
> Key: KAFKA-13069
> URL: https://issues.apache.org/jira/browse/KAFKA-13069
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13069) Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde

2021-07-12 Thread Ron Dagostino (Jira)

Ron Dagostino created KAFKA-13069:
-

 Summary: Add magic number to 
DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde
 Key: KAFKA-13069
 URL: https://issues.apache.org/jira/browse/KAFKA-13069
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.8.0, 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12946) __consumer_offsets topic with very big partitions

2021-06-15 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363805#comment-17363805
 ] 

Ron Dagostino commented on KAFKA-12946:
---

The only one I am familiar with and would recommend is the upgrade.

> __consumer_offsets topic with very big partitions
> -
>
> Key: KAFKA-12946
> URL: https://issues.apache.org/jira/browse/KAFKA-12946
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.0.0
>Reporter: Emi
>Priority: Critical
>
> I am using Kafka 2.0.0 with java 8u191
>  There is a partitions of the __consumer_offsets topic that is 600 GB with 
> 6000 segments older than 4 months. Other partitions of that topic are small: 
> 20-30 MB.
> There are 60 consumer groups, 90 topics and 100 partitions per topic.
> There aren't errors in the logs. From the log of the logcleaner, I can see 
> that partition is never touched from the logcleaner thread for the 
> compaction, but it only add new segments.
>  How is this possible?
> There was another partition with the same problem, but after some months it 
> has been compacted. Now there is only one partition with this problem, but 
> this is bigger and keep growing
> I have used the kafka-dump-log tool to check these old segments and I can see 
> many duplicates. So I would assume that is not compacted.
> My settings:
>  {{offsets.commit.required.acks = -1}}
>  {{[offsets.commit.timeout.ms|http://offsets.commit.timeout.ms/]}} = 5000
>  {{offsets.load.buffer.size = 5242880}}
>  
> {{[offsets.retention.check.interval.ms|http://offsets.retention.check.interval.ms/]}}
>  = 60
>  {{offsets.retention.minutes = 10080}}
>  {{offsets.topic.compression.codec = 0}}
>  {{offsets.topic.num.partitions = 50}}
>  {{offsets.topic.replication.factor = 3}}
>  {{offsets.topic.segment.bytes = 104857600}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12946) __consumer_offsets topic with very big partitions

2021-06-15 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363584#comment-17363584
 ] 

Ron Dagostino commented on KAFKA-12946:
---

Yeah, there are bugs.  The KIP I referred to mentions one.  There have also 
been several changes to make the log cleaner thread more robust to failure over 
time — even since the 2.0 version you are on.  Upgrading might not help 
immediately, but you will want to leverage the KIP-664 tools at some point, so 
best to keep current.  You should definitely read that KIP.

> __consumer_offsets topic with very big partitions
> -
>
> Key: KAFKA-12946
> URL: https://issues.apache.org/jira/browse/KAFKA-12946
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.0.0
>Reporter: Emi
>Priority: Critical
>
> I am using Kafka 2.0.0 with java 8u191
>  There is a partitions of the __consumer_offsets topic that is 600 GB with 
> 6000 segments older than 4 months. Other partitions of that topic are small: 
> 20-30 MB.
> There are 60 consumer groups, 90 topics and 100 partitions per topic.
> There aren't errors in the logs. From the log of the logcleaner, I can see 
> that partition is never touched from the logcleaner thread for the 
> compaction, but it only add new segments.
>  How is this possible?
> There was another partition with the same problem, but after some months it 
> has been compacted. Now there is only one partition with this problem, but 
> this is bigger and keep growing
> I have used the kafka-dump-log tool to check these old segments and I can see 
> many duplicates. So I would assume that is not compacted.
> My settings:
>  {{offsets.commit.required.acks = -1}}
>  {{[offsets.commit.timeout.ms|http://offsets.commit.timeout.ms/]}} = 5000
>  {{offsets.load.buffer.size = 5242880}}
>  
> {{[offsets.retention.check.interval.ms|http://offsets.retention.check.interval.ms/]}}
>  = 60
>  {{offsets.retention.minutes = 10080}}
>  {{offsets.topic.compression.codec = 0}}
>  {{offsets.topic.num.partitions = 50}}
>  {{offsets.topic.replication.factor = 3}}
>  {{offsets.topic.segment.bytes = 104857600}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12946) __consumer_offsets topic with very big partitions

2021-06-15 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363496#comment-17363496
 ] 

Ron Dagostino commented on KAFKA-12946:
---

I mean if you take a look at the size on disk, is the size of the log 
significantly smaller?  Broker 0 might be the leader for partition 0 with 600 
GB of size.  Maybe broker 1 is a follower with about the same 600 GB size, but 
perhaps broker 2 is a follower with just 100 MB.  It is unexplained why this 
would occur, but it is possible, and if so then you can make 2 the leader, move 
1 to 3, move 3 back to 1, move 0 to 3, move 3 back to 0, and then make 0 the 
leader again -- now you have the same leadership and followers as before but 
100 MB on all 3 replicas.

> __consumer_offsets topic with very big partitions
> -
>
> Key: KAFKA-12946
> URL: https://issues.apache.org/jira/browse/KAFKA-12946
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.0.0
>Reporter: Emi
>Priority: Critical
>
> I am using Kafka 2.0.0 with java 8u191
>  There is a partitions of the __consumer_offsets topic that is 600 GB with 
> 6000 segments older than 4 months. Other partitions of that topic are small: 
> 20-30 MB.
> There are 60 consumer groups, 90 topics and 100 partitions per topic.
> There aren't errors in the logs. From the log of the logcleaner, I can see 
> that partition is never touched from the logcleaner thread for the 
> compaction, but it only add new segments.
>  How is this possible?
> There was another partition with the same problem, but after some months it 
> has been compacted. Now there is only one partition with this problem, but 
> this is bigger and keep growing
> I have used the kafka-dump-log tool to check these old segments and I can see 
> many duplicates. So I would assume that is not compacted.
> My settings:
>  {{offsets.commit.required.acks = -1}}
>  {{[offsets.commit.timeout.ms|http://offsets.commit.timeout.ms/]}} = 5000
>  {{offsets.load.buffer.size = 5242880}}
>  
> {{[offsets.retention.check.interval.ms|http://offsets.retention.check.interval.ms/]}}
>  = 60
>  {{offsets.retention.minutes = 10080}}
>  {{offsets.topic.compression.codec = 0}}
>  {{offsets.topic.num.partitions = 50}}
>  {{offsets.topic.replication.factor = 3}}
>  {{offsets.topic.segment.bytes = 104857600}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12946) __consumer_offsets topic with very big partitions

2021-06-14 Thread Ron Dagostino (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363155#comment-17363155
 ] 

Ron Dagostino commented on KAFKA-12946:
---

If the partition isn't being cleaned then you can try setting 
min.cleanable.dirty.ratio=0 for the __consumer_offsets topic; this might allow 
it to get cleaned.  You can delete that config after a while to let the value 
default back.

Another possibility might exist if one of the follower replicas has a 
significantly smaller size than the leader; in such cases you can move 
leadership to the smaller replica and then reassign the follower replicas to 
new brokers so that they will copy the (much smaller-sized) data; then you can 
migrate the followers back to where they were originally and move the leader 
back to the original leader.  This solution will only work if you have more 
brokers than the replication factor.

Finally, take a look at 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions.
  You may not have any other options right now if it is a hanging transaction, 
but help is coming.

> __consumer_offsets topic with very big partitions
> -
>
> Key: KAFKA-12946
> URL: https://issues.apache.org/jira/browse/KAFKA-12946
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.0.0
>Reporter: Emi
>Priority: Critical
>
> I am using Kafka 2.0.0 with java 8u191
>  There is a partitions of the __consumer_offsets topic that is 600 GB with 
> 6000 segments older than 4 months. Other partitions of that topic are small: 
> 20-30 MB.
> There are 60 consumer groups, 90 topics and 100 partitions per topic.
> There aren't errors in the logs. From the log of the logcleaner, I can see 
> that partition is never touched from the logcleaner thread for the 
> compaction, but it only add new segments.
>  How is this possible?
> There was another partition with the same problem, but after some months it 
> has been compacted. Now there is only one partition with this problem, but 
> this is bigger and keep growing
> I have used the kafka-dump-log tool to check these old segments and I can see 
> many duplicates. So I would assume that is not compacted.
> My settings:
>  {{offsets.commit.required.acks = -1}}
>  {{[offsets.commit.timeout.ms|http://offsets.commit.timeout.ms/]}} = 5000
>  {{offsets.load.buffer.size = 5242880}}
>  
> {{[offsets.retention.check.interval.ms|http://offsets.retention.check.interval.ms/]}}
>  = 60
>  {{offsets.retention.minutes = 10080}}
>  {{offsets.topic.compression.codec = 0}}
>  {{offsets.topic.num.partitions = 50}}
>  {{offsets.topic.replication.factor = 3}}
>  {{offsets.topic.segment.bytes = 104857600}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12897) KRaft Controller cannot create topic with multiple partitions on a single broker cluster

2021-06-04 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-12897:
--
Description: https://github.com/apache/kafka/pull/10494 introduced a bug in 
the KRaft controller where the controller will loop forever in 
`StripedReplicaPlacer` trying to identify the racks on which to place partition 
replicas if there is a single unfenced broker in the cluster and the number of 
requested partitions in a CREATE_TOPICS request is greater than 1.  (was: 
https://github.com/apache/kafka/pull/10494 introduced a bug in the KRaft 
controller where the controller will loop forever in `StripedReplicaPlacer` 
trying to identify the racks on which to place partition replicas if the number 
of requested replicas (i.e. replication factor) in a CREATE_TOPICS request 
exceeds the number of effective racks ("effective" meaning a single rack if 
none are specified).)

> KRaft Controller cannot create topic with multiple partitions on a single 
> broker cluster
> 
>
> Key: KAFKA-12897
> URL: https://issues.apache.org/jira/browse/KAFKA-12897
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.0.0
>
>
> https://github.com/apache/kafka/pull/10494 introduced a bug in the KRaft 
> controller where the controller will loop forever in `StripedReplicaPlacer` 
> trying to identify the racks on which to place partition replicas if there is 
> a single unfenced broker in the cluster and the number of requested 
> partitions in a CREATE_TOPICS request is greater than 1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12897) KRaft Controller cannot create topic with multiple partitions on a single broker cluster

2021-06-04 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-12897:
--
Summary: KRaft Controller cannot create topic with multiple partitions on a 
single broker cluster  (was: KRaft Controller cannot create topic with 
replication factor greater than number of racks)

> KRaft Controller cannot create topic with multiple partitions on a single 
> broker cluster
> 
>
> Key: KAFKA-12897
> URL: https://issues.apache.org/jira/browse/KAFKA-12897
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.0.0
>
>
> https://github.com/apache/kafka/pull/10494 introduced a bug in the KRaft 
> controller where the controller will loop forever in `StripedReplicaPlacer` 
> trying to identify the racks on which to place partition replicas if the 
> number of requested replicas (i.e. replication factor) in a CREATE_TOPICS 
> request exceeds the number of effective racks ("effective" meaning a single 
> rack if none are specified).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12897) KRaft Controller cannot create topic with replication factor greater than number of racks

2021-06-04 Thread Ron Dagostino (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-12897:
--
Description: https://github.com/apache/kafka/pull/10494 introduced a bug in 
the KRaft controller where the controller will loop forever in 
`StripedReplicaPlacer` trying to identify the racks on which to place partition 
replicas if the number of requested replicas (i.e. replication factor) in a 
CREATE_TOPICS request exceeds the number of effective racks ("effective" 
meaning a single rack if none are specified).  (was: 
https://github.com/apache/kafka/pull/10494 introduced a bug in the KRaft 
controller where the controller will loop forever in `StripedReplicaPlacer` 
trying to identify the racks on which to place partitions if the number of 
requested replicas (i.e. replication factor) in a CREATE_TOPICS request exceeds 
the number of effective racks ("effective" meaning a single rack if none are 
specified).)

> KRaft Controller cannot create topic with replication factor greater than 
> number of racks
> -
>
> Key: KAFKA-12897
> URL: https://issues.apache.org/jira/browse/KAFKA-12897
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 3.0.0
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Critical
> Fix For: 3.0.0
>
>
> https://github.com/apache/kafka/pull/10494 introduced a bug in the KRaft 
> controller where the controller will loop forever in `StripedReplicaPlacer` 
> trying to identify the racks on which to place partition replicas if the 
> number of requested replicas (i.e. replication factor) in a CREATE_TOPICS 
> request exceeds the number of effective racks ("effective" meaning a single 
> rack if none are specified).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 177 matches

Mail list logo