[jira] [Created] (KAFKA-8695) Metrics UnderReplicated and UnderMinSir are diverging when configuration is inconsistent

2019-07-22 Thread Alexandre Dupriez (JIRA)
Alexandre Dupriez created KAFKA-8695:


 Summary: Metrics UnderReplicated and UnderMinSir are diverging 
when configuration is inconsistent
 Key: KAFKA-8695
 URL: https://issues.apache.org/jira/browse/KAFKA-8695
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.3.0, 2.1.1, 2.2.0, 2.1.0
Reporter: Alexandre Dupriez


As of now, Kafka allows the replication factor of a topic and 
"min.insync.replicas" to be set such that "min.insync.replicas" > the topic's 
replication factor.

As a consequences, the JMX beans
{code:java}
kafka.cluster:type=Partition,name=UnderReplicated{code}
and 
{code:java}
kafka.cluster:type=Partition,name=UnderMinIsr{code}
can report diverging views on the replication for a topic. The former can 
report no under replicated partition, while the second will report under 
in-sync replicas.

 

Even worse, consumption of topics which exhibit this behaviour seems to fail, 
the Kafka broker throwing a NotEnoughReplicasException.

 

 
{code:java}
[2019-07-22 10:44:29,913] ERROR [ReplicaManager broker=0] Error processing 
append operation on partition __consumer_offsets-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: The size of the 
current ISR Set(0) is insufficient to satisfy the min.isr requirement of 2 for 
partition __consumer_offsets-0{code}
 

 

In order to avoid this scenario, one possibility would be to check the values 
of "min.insync.replicas" and "default.replication.factor" when the broker 
starts, and "min.insync.replicas" and the replication factor given to a topic 
at creation time, and refuses to create the topic if those are inconsistently 
set.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (KAFKA-8752) Ensure plugin classes are instantiable when discovering plugins

2019-08-05 Thread Alexandre Dupriez (JIRA)
Alexandre Dupriez created KAFKA-8752:


 Summary: Ensure plugin classes are instantiable when discovering 
plugins
 Key: KAFKA-8752
 URL: https://issues.apache.org/jira/browse/KAFKA-8752
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Alexandre Dupriez
Assignee: Alexandre Dupriez


While running integration tests from the IntelliJ IDE, it appears plugins fail 
to load in {{DelegatingClassLoader.scanUrlsAndAddPlugins}}. The reason was, in 
this case, that the class 
{{org.apache.kafka.connect.connector.ConnectorReconfigurationTest$TestConnector}}
 could not be instantiated - which it does not intend to be.

The problem does not when running integration tests with Gradle as the runtime 
closure is different from IntelliJ - which includes test sources from modules 
depended on on the classpath.

While debugging this minor inconvenience, I could see that 
{{DelegatingClassLoader}} performs a sanity check on the plugin class to 
instantiate - as of now, it verifies the class is concrete. A quick fix for the 
problem highlighted above could to add an extra condition on the Java modifiers 
of the class to ensure it will be instantiable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (KAFKA-8695) Metrics UnderReplicated and UnderMinIsr are diverging when configuration is inconsistent

2019-08-13 Thread Alexandre Dupriez (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-8695.
--
Resolution: Duplicate

De-duplicating in favor of 
[KAFKA-4680|http://issues.apache.org/jira/browse/KAFKA-4680].

> Metrics UnderReplicated and UnderMinIsr are diverging when configuration is 
> inconsistent
> 
>
> Key: KAFKA-8695
> URL: https://issues.apache.org/jira/browse/KAFKA-8695
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0, 2.2.0, 2.1.1, 2.3.0
>Reporter: Alexandre Dupriez
>Assignee: Alexandre Dupriez
>Priority: Minor
>
> As of now, Kafka allows the replication factor of a topic and 
> "min.insync.replicas" to be set such that "min.insync.replicas" > the topic's 
> replication factor.
> As a consequence, the JMX beans
> {code:java}
> kafka.cluster:type=Partition,name=UnderReplicated{code}
> and 
> {code:java}
> kafka.cluster:type=Partition,name=UnderMinIsr{code}
> can report diverging views on the replication for a topic. The former can 
> report no under replicated partition, while the second will report under 
> in-sync replicas.
> Even worse, consumption of topics which exhibit this behaviour seems to fail, 
> the Kafka broker throwing a NotEnoughReplicasException. 
> {code:java}
> [2019-07-22 10:44:29,913] ERROR [ReplicaManager broker=0] Error processing 
> append operation on partition __consumer_offsets-0 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.NotEnoughReplicasException: The size of the 
> current ISR Set(0) is insufficient to satisfy the min.isr requirement of 2 
> for partition __consumer_offsets-0 {code}
> In order to avoid this scenario, one possibility would be to check the values 
> of "min.insync.replicas" and "default.replication.factor" when the broker 
> starts, and "min.insync.replicas" and the replication factor given to a topic 
> at creation time, and refuses to create the topic if those are inconsistently 
> set.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (KAFKA-8815) Kafka broker blocked on I/O primitive

2020-05-16 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-8815.
--
Resolution: Not A Problem

System failure. Not related to Kafka.

> Kafka broker blocked on I/O primitive
> -
>
> Key: KAFKA-8815
> URL: https://issues.apache.org/jira/browse/KAFKA-8815
> Project: Kafka
>  Issue Type: Bug
>  Components: core, log
>Affects Versions: 1.1.1
>Reporter: Alexandre Dupriez
>Priority: Major
>
> This JIRA is for tracking a problem we run into on a production cluster.
> *Scenario*
> Cluster of 15 brokers and an average ingress throughput of ~4 MB/s and egress 
> of ~4 MB/s per broker.
> Brokers are running on OpenJDK 8. They are configured with a heap size of 1 
> GB.
> There is around ~1,000 partition replicas per broker. Load is evenly 
> balanced. Each broker instance is under fair CPU load, but not overloaded 
> (50-60%). G1 is used for garbage collection and doesn't exhibit any pressure, 
> with mostly short young GC observed and an heap-after-GC usage of 70%.
> Replication factor is 3.
> *Symptom*
> One broker on the cluster suddenly became "unresponsive". Other brokers, 
> Zookeeper and producers/consumers requests were failing with timeouts. The 
> Kafka process, however, was still alive and doing some background work 
> (truncating logs and rolling segments) This lasted for hours. At some point, 
> several thread dumps were taken at few minutes interval. Most of the threads 
> were "blocked". Deadlock was ruled out. The most suspicious stack is the 
> following 
> {code:java}
> Thread 7801: (state = BLOCKED)
>  - sun.nio.ch.FileChannelImpl.write(java.nio.ByteBuffer) @bci=25, line=202 
> (Compiled frame)
>  - 
> org.apache.kafka.common.record.MemoryRecords.writeFullyTo(java.nio.channels.GatheringByteChannel)
>  @bci=24, line=93 (Compiled frame)
>  - 
> org.apache.kafka.common.record.FileRecords.append(org.apache.kafka.common.record.MemoryRecords)
>  @bci=5, line=152 (Compiled frame)
>  - kafka.log.LogSegment.append(long, long, long, long, 
> org.apache.kafka.common.record.MemoryRecords) @bci=82, line=136 (Compiled 
> frame)
>  - kafka.log.Log.$anonfun$append$2(kafka.log.Log, 
> org.apache.kafka.common.record.MemoryRecords, boolean, boolean, int, 
> java.lang.Object) @bci=1080, line=757 (Compiled frame)
>  - kafka.log.Log$$Lambda$614.apply() @bci=24 (Compiled frame)
>  - kafka.log.Log.maybeHandleIOException(scala.Function0, scala.Function0) 
> @bci=1, line=1696 (Compiled frame)
>  - kafka.log.Log.append(org.apache.kafka.common.record.MemoryRecords, 
> boolean, boolean, int) @bci=29, line=642 (Compiled frame)
>  - kafka.log.Log.appendAsLeader(org.apache.kafka.common.record.MemoryRecords, 
> int, boolean) @bci=5, line=612 (Compiled frame)
>  - 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(kafka.cluster.Partition,
>  org.apache.kafka.common.record.MemoryRecords, boolean, int) @bci=148, 
> line=609 (Compiled frame)
>  - kafka.cluster.Partition$$Lambda$837.apply() @bci=16 (Compiled frame)
>  - kafka.utils.CoreUtils$.inLock(java.util.concurrent.locks.Lock, 
> scala.Function0) @bci=7, line=250 (Compiled frame)
>  - 
> kafka.utils.CoreUtils$.inReadLock(java.util.concurrent.locks.ReadWriteLock, 
> scala.Function0) @bci=8, line=256 (Compiled frame)
>  - 
> kafka.cluster.Partition.appendRecordsToLeader(org.apache.kafka.common.record.MemoryRecords,
>  boolean, int) @bci=16, line=597 (Compiled frame)
>  - 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(kafka.server.ReplicaManager,
>  boolean, boolean, short, scala.Tuple2) @bci=295, line=739 (Compiled frame)
>  - kafka.server.ReplicaManager$$Lambda$836.apply(java.lang.Object) @bci=20 
> (Compiled frame)
>  - scala.collection.TraversableLike.$anonfun$map$1(scala.Function1, 
> scala.collection.mutable.Builder, java.lang.Object) @bci=3, line=234 
> (Compiled frame)
>  - scala.collection.TraversableLike$$Lambda$14.apply(java.lang.Object) @bci=9 
> (Compiled frame)
>  - scala.collection.mutable.HashMap.$anonfun$foreach$1(scala.Function1, 
> scala.collection.mutable.DefaultEntry) @bci=16, line=138 (Compiled frame)
>  - scala.collection.mutable.HashMap$$Lambda$31.apply(java.lang.Object) @bci=8 
> (Compiled frame)
>  - scala.collection.mutable.HashTable.foreachEntry(scala.Function1) @bci=39, 
> line=236 (Compiled frame)
>  - 
> scala.collection.mutable.HashTable.foreachEntry$(scala.collection.mutable.HashTable,
>  scala.Function1) @bci=2, line=229 (Compiled frame)
>  - scala.collection.mutable.HashMap.foreachEntry(scala.Function1) @bci=2, 
> line=40 (Compiled frame)
>  - scala.collection.mutable.HashMap.foreach(scala.Function1) @bci=7, line=138 
> (Compiled frame)
>  - scala.collection.TraversableLike.map(scala.Function1, 
> scala.collection.generic.CanBuildFrom) @bci=14, line=234 (Compiled fra

[jira] [Resolved] (KAFKA-8916) Unreliable kafka-reassign-partitions.sh affecting performance

2020-05-16 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-8916.
--
Resolution: Invalid

Closing this as there is no bug or development required.

Please kindly reach out to the user mailing list for this type of question: 
us...@kafka.apache.org. More information to engage the Kafka community is 
available on https://kafka.apache.org/contact.html.


> Unreliable kafka-reassign-partitions.sh affecting performance
> -
>
> Key: KAFKA-8916
> URL: https://issues.apache.org/jira/browse/KAFKA-8916
> Project: Kafka
>  Issue Type: Task
>  Components: admin, config
>Affects Versions: 2.1.1
> Environment: CentOS 7
>Reporter: VinayKumar
>Priority: Major
>  Labels: performance
>
> Currently I have 3 node kafka cluster, and I want to add 2 more nodes to make 
> it 5 node cluster.
>  *After adding the nodes to cluster, I need all the topic partitions to be 
> evenly distributed across all the 5 nodes.
>  **In the past, when I ran kafka-reassign-partitions.sh & 
> kafka-preferred-replica-election.sh, it ran for very long time, hung & made 
> the cluster unstable. So I'm afraid to use this method.
> Can you please suggest the best & foolproof way to assign partitions among 
> all the cluster nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-6959) Any impact we foresee if we upgrade Linux version or move to VM instead of physical Linux server

2020-05-16 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-6959.
--
Resolution: Fixed

> Any impact we foresee if we upgrade Linux version or move to VM instead of 
> physical Linux server
> 
>
> Key: KAFKA-6959
> URL: https://issues.apache.org/jira/browse/KAFKA-6959
> Project: Kafka
>  Issue Type: Task
>  Components: admin
>Affects Versions: 0.11.0.2
> Environment: Prod
>Reporter: Gene Yi
>Priority: Trivial
>  Labels: patch, performance, security
>
> As we know that the recent issue on the Liunx Meltdown and Spectre. all the 
> Linux servers need to deploy the patch and the OS version at least to be 6.9. 
> we want to know the impact to Kafka, is there any side effect if we directly 
> upgrade the OS to 7.0,  also is there any limitation if we deploy Kafka to VM 
> instead of the physical servers?
> currently the Kafka version we used is 0.11.0.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9554) Define the SPI for Tiered Storage framework

2020-02-14 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-9554:


 Summary: Define the SPI for Tiered Storage framework
 Key: KAFKA-9554
 URL: https://issues.apache.org/jira/browse/KAFKA-9554
 Project: Kafka
  Issue Type: Sub-task
  Components: clients, core
Reporter: Alexandre Dupriez
Assignee: Alexandre Dupriez


The goal of this task is to define the SPI (service provider interfaces) which 
will be used by vendors to implement plug-ins to communicate with specific 
storage system.



Done means:
 * Package with interfaces and key objects available and published for review.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9555) Topic-based implementation for the RLMM

2020-02-14 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-9555:


 Summary: Topic-based implementation for the RLMM
 Key: KAFKA-9555
 URL: https://issues.apache.org/jira/browse/KAFKA-9555
 Project: Kafka
  Issue Type: Sub-task
  Components: core
Reporter: Alexandre Dupriez






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9564) Integration Tests for Tiered Storage

2020-02-17 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-9564:


 Summary: Integration Tests for Tiered Storage
 Key: KAFKA-9564
 URL: https://issues.apache.org/jira/browse/KAFKA-9564
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez
Assignee: Alexandre Dupriez






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9565) Implementation of Tiered Storage SPI to integrate with S3

2020-02-18 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-9565:


 Summary: Implementation of Tiered Storage SPI to integrate with S3
 Key: KAFKA-9565
 URL: https://issues.apache.org/jira/browse/KAFKA-9565
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-7958) Transactions are broken with kubernetes hosted brokers

2020-03-20 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-7958.
--
Fix Version/s: 2.1.1
   Resolution: Fixed

> Transactions are broken with kubernetes hosted brokers
> --
>
> Key: KAFKA-7958
> URL: https://issues.apache.org/jira/browse/KAFKA-7958
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: cp-kakfka 2.1.1-1, kafka-streams 2.1.1
>Reporter: Thomas Dickinson
>Priority: Major
> Fix For: 2.1.1
>
>
> After a rolling re-start in a kubernetes-like environment, brokers may change 
> IP address.  From our logs it seems that the transaction manager in the 
> brokers never re-resolves the DNS name of other brokers, keeping stale pod 
> IPs.  Thus transactions stop working.  
> ??[2019-02-20 02:20:20,085] WARN [TransactionCoordinator id=1001] Connection 
> to node 0 
> (khaki-joey-kafka-0.khaki-joey-kafka-headless.hyperspace-dev/[10.233.124.181:9092|http://10.233.124.181:9092/])
>  could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)??
> ??[2019-02-20 02:20:57,205] WARN [TransactionCoordinator id=1001] Connection 
> to node 1 
> (khaki-joey-kafka-1.khaki-joey-kafka-headless.hyperspace-dev/[10.233.122.67:9092|http://10.233.122.67:9092/])
>  could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)??
> This is from the log from broker 1001 which was restarted first, followed by 
> 1 and then 0.  The log entries are from the day after the rolling restart.
> I note a similar issue was fixed for clients 2.1.1  
> https://issues.apache.org/jira/browse/KAFKA-7755.  We are using streams lib 
> 2.1.1
> *Update* We are now testing with Kafka 2.1.1 (docker cp-kafka 5.1.2-1) and it 
> looks like this may be resolved.  Would be good to get confirmation though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9549) Local storage implementations for RSM which can be used in tests

2020-04-27 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-9549.
--
Resolution: Fixed

> Local storage implementations for RSM which can be used in tests
> 
>
> Key: KAFKA-9549
> URL: https://issues.apache.org/jira/browse/KAFKA-9549
> Project: Kafka
>  Issue Type: Sub-task
>  Components: system tests
>Reporter: Satish Duggana
>Assignee: Alexandre Dupriez
>Priority: Major
>
> The goal of this task is to implement a straightforward file-system based 
> implementation of the {{RemoteStorageManager}} defined as part of the SPI for 
> Tiered Storage.
> It is intended to be used in single-host integration tests where the remote 
> storage is or can be exercised. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-14190) Corruption of Topic IDs with pre-2.8.0 admin clients

2022-08-30 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14190:
-

 Summary: Corruption of Topic IDs with pre-2.8.0 admin clients
 Key: KAFKA-14190
 URL: https://issues.apache.org/jira/browse/KAFKA-14190
 Project: Kafka
  Issue Type: Bug
  Components: admin, core, zkclient
Affects Versions: 3.2.1, 3.1.1, 3.2.0, 3.0.1, 3.0.0, 2.8.1, 3.1.0
Reporter: Alexandre Dupriez


h4. Scope

The problem reported below has been verified to occur with Zookeeper 
controllers. It has not been attempted with Kraft controllers, although it is 
unlikely to be reproduced in Kraft mode given the nature of the issue and 
clients involved.
h4. Problem Description

There is a loss of topic IDs when an AdminClient of version < 2.8.0 is used to 
increase the number of partitions of a topic for a cluster with version >= 
2.8.0. This results in the controller re-creating topic IDs upon restart, 
eventually conflicting with the topic ID of broker’s partition.metadata files 
in the partition directories of the impacted topic, leading to an availability 
loss of the partitions which do not accept leadership / follower-ship when the 
topic ID indicated by a LeaderAndIsr request differ from their own locally 
cached ID.

One mitigation post-corruption is to substitute the stale topic ID in the 
partition.metadata files with the new topic ID referenced by the controller, or 
alternatively, delete the partition.metadata file altogether. 
h4. Steps to reproduce

1. Set-up and launch a two-nodes Kafka cluster in Zookeeper mode.

2. Create a topic e.g. via {{kafka-topics.sh}}

 
{noformat}
./bin/kafka-topics.sh --bootstrap-server :9092 --create --topic myTopic 
--partitions 2 --replication-factor 2{noformat}
3. Capture the topic ID using a 2.8.0+ client.

 
{noformat}
./kafka/bin/kafka-topics.sh --bootstrap-server :9092 --topic myTopic --describe

Topic: myTopic TopicId: jKTRaM_TSNqocJeQI2aYOQ PartitionCount: 2 
ReplicationFactor: 2 Configs: segment.bytes=1073741824
Topic: myTopic Partition: 0 Leader: 0 Replicas: 1,0 Isr: 0,1
Topic: myTopic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1{noformat}
 

4. Restart one of the broker. This will make each broker create the 
{{partition.metadata}} files in the partition directories since it will already 
have loaded the {{Log}} instance in memory.

 

5. Using a pre-2.8.0 client library, run the following command.

 
{noformat}
./kafka/bin/kafka-topics.sh --zookeeper :2181 --alter --topic myTopic 
--partitions 3{noformat}
 

6. Using a 2.8.0+ client library, describe the topic via Zookeeper and notice 
the absence of topic ID from the output, where it is otherwise expected.

 
{noformat}
./kafka/bin/kafka-topics.sh —zookeeper :2181 —describe —topic myTopic

Topic: myTopic PartitionCount: 3 ReplicationFactor: 2 Configs: 
Topic: myTopic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 0,1
Topic: myTopic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: myTopic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0{noformat}
 

7. Using a 2.8.0+ client library, describe the topic via a broker endpoint and 
notice the topic ID changed.

 
{noformat}
./kafka/bin/kafka-topics.sh —bootstrap-server :9093 —describe —topic myTopic

Topic: myTopic TopicId: nI-JQtPwQwGiylMfm8k13w PartitionCount: 3 
ReplicationFactor: 2 Configs: segment.bytes=1073741824
Topic: myTopic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: myTopic Partition: 1 Leader: 1 Replicas: 0,1 Isr: 1,0
Topic: myTopic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0{noformat}
 

8. Restart the controller.

9. Check the state-change.log file on the controller broker. The following type 
of logs will appear.

 
{noformat}
[2022-08-25 17:44:05,308] ERROR [Broker id=0] Topic Id in memory: 
jKTRaM_TSNqocJeQI2aYOQ does not match the topic Id for partition myTopic-0 
provided in the request: nI-JQtPwQwGiylMfm8k13w. (state.change.logger){noformat}
 

10. Restart the other broker.

11. Describe the topic via the broker endpoint or Zookeeper with a 2.8.0+ 
client library

 
{noformat}
./kafka/bin/kafka-topics.sh --zookeeper :2181 --describe --topic myTopic

Topic: myTopic TopicId: nI-JQtPwQwGiylMfm8k13w PartitionCount: 3 
ReplicationFactor: 2 Configs: 
Topic: myTopic Partition: 0 Leader: 0 Replicas: 1,0 Isr: 0
Topic: myTopic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0
Topic: myTopic Partition: 2 Leader: 0 Replicas: 1,0 Isr: 0{noformat}
 

Notice the abnormal state the topic is in: ISR is reduced to one single broker 
which is claimed to be the leader by the controller (here, broker 0). The 
controller believes 0 is the leader because it does not handle the error 
response from peer brokers when sending the requests for them to become a 
leader or follower of a partition.

12. Verify produce is unavailable.

 
{noformat}
./kafka/bin/kafka-console-producer.sh —bootstrap-server:9092 —topic myTopic

[2022-08-25 17:52:59,962] ERROR Error when sending message to topic myTopic 

[jira] [Resolved] (KAFKA-8752) Ensure plugin classes are instantiable when discovering plugins

2023-02-20 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez resolved KAFKA-8752.
--
Resolution: Not A Problem

> Ensure plugin classes are instantiable when discovering plugins
> ---
>
> Key: KAFKA-8752
> URL: https://issues.apache.org/jira/browse/KAFKA-8752
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Alexandre Dupriez
>Assignee: Alexandre Dupriez
>Priority: Minor
> Attachments: stacktrace.log
>
>
> While running integration tests from the IntelliJ IDE, it appears plugins 
> fail to load in {{DelegatingClassLoader.scanUrlsAndAddPlugins}}. The reason 
> was, in this case, that the class 
> {{org.apache.kafka.connect.connector.ConnectorReconfigurationTest$TestConnector}}
>  could not be instantiated - which it does not intend to be.
> The problem does not appear when running integration tests with Gradle as the 
> runtime closure is different from IntelliJ - which includes test sources from 
> module dependencies on the classpath.
> While debugging this minor inconvenience, I could see that 
> {{DelegatingClassLoader}} performs a sanity check on the plugin class to 
> instantiate - as of now, it verifies the class is concrete. A quick fix for 
> the problem highlighted above could to add an extra condition on the Java 
> modifiers of the class to ensure it will be instantiable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15486) Include NIO exceptions as I/O exceptions to be part of disk failure handling

2023-09-22 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-15486:
-

 Summary: Include NIO exceptions as I/O exceptions to be part of 
disk failure handling
 Key: KAFKA-15486
 URL: https://issues.apache.org/jira/browse/KAFKA-15486
 Project: Kafka
  Issue Type: Improvement
  Components: core, jbod
Reporter: Alexandre Dupriez


Currently, Apache Kafka offers the ability to detect and capture I/O errors 
when accessing the file system via the standard {{IOException}} from the JDK. 
There are cases however, where I/O errors are only reported via exceptions such 
as {{{}BufferOverflowException{}}}, without associated {{IOException}} on the 
produce or read path, so that the data volume is not detected as unhealthy and 
not included in the list of offline directories.

Specifically, we faced the following scenario on a broker:
 * The data volume hosting a log directory became saturated.
 * As expected, {{IOException}} were generated on the read/write path.
 * The log directory was set as offline and since it was the only log directory 
configured on the broker, Kafka automatically shut down.
 * Additional space was added to the data volume.
 * Kafka was then restarted.
 * No more {{IOException}} occurred, however {{BufferOverflowException}} *[*]* 
were raised while trying to delete log segments in oder to honour the retention 
settings of a topic. The log directory was not moved to offline and the 
exceptions kept re-occurring indefinitely.

The retention settings were therefore not applied in this case. The mitigation 
consisted in restarting Kafka.

It may be worth considering adding {{BufferOverflowException}} and 
{{BufferUnderflowException}} (and any other related exception from the JDK NIO 
library which surfaces an I/O error) to the current {{IOException}} as a proxy 
of storage I/O failure, although there may be known unintended consequences in 
doing so which is the reason they were not added already, or, it may be too 
marginal of an impact to modify the main I/O failure handing path to risk 
exposing it to such unknown unintended consequences.

*[*]*
{code:java}
java.nio.BufferOverflowException     at 
java.base/java.nio.Buffer.nextPutIndex(Buffer.java:674)     at 
java.base/java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:882)     at 
kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134)     at 
kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114)     at 
kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:506)     at 
kafka.log.Log.$anonfun$roll$8(Log.scala:2066)     at 
kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:2066)     at 
scala.Option.foreach(Option.scala:437)     at 
kafka.log.Log.$anonfun$roll$2(Log.scala:2066)     at 
kafka.log.Log.roll(Log.scala:2482)     at 
kafka.log.Log.maybeRoll(Log.scala:2017)     at 
kafka.log.Log.append(Log.scala:1292)     at 
kafka.log.Log.appendAsFollower(Log.scala:1155)     at 
kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:1023)
     at 
kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:1030)
     at 
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:178)
     at 
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:356)
     at scala.Option.foreach(Option.scala:437)     at 
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:345)
     at 
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:344)
     at 
kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
     at 
scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359)
     at 
scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
     at 
scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
     at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:344)
     at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141)
     at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140)
     at scala.Option.foreach(Option.scala:437)     at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140)  
   at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123)     
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
{code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint

2023-09-23 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-15490:
-

 Summary: Invalid path provided to the log failure channel upon I/O 
error when writing broker metadata checkpoint
 Key: KAFKA-15490
 URL: https://issues.apache.org/jira/browse/KAFKA-15490
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Alexandre Dupriez


There is a small bug/typo in the handling of I/O error when writing broker 
metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir 
failure channel is the full path of the checkpoint file whereas only the log 
directory is expected 
([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]).

 
{code:java}
case e: IOException =>
          val dirPath = checkpoint.file.getAbsolutePath
          logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while 
writing meta.properties to $dirPath", e){code}
 

As a result, after an {{IOException}} is captured and enqueued in the log dir 
failure channel 

{{}}
{code:java}

{code}
{{[2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to 
/meta.properties (kafka.server.LogDirFailureChannel) 
java.io.IOException}}

 

The log dir failure handler cannot lookup the log directory:

{{}}
{code:java}

{code}
{{[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to 
(kafka.server.ReplicaManager$LogDirFailureHandler) 
org.apache.kafka.common.errors.LogDirNotFoundException: Log dir 
/meta.properties is not found in the config.}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions

2023-10-24 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-15678:
-

 Summary: [Tiered Storage] Stall remote reads with long-spanning 
transactions
 Key: KAFKA-15678
 URL: https://issues.apache.org/jira/browse/KAFKA-15678
 Project: Kafka
  Issue Type: Bug
  Components: Tiered-Storage
Affects Versions: 3.6.0
Reporter: Alexandre Dupriez


Hi team,

I am facing an issue on the remote data path for uncommitted reads.

As mentioned in [the original 
PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a 
transaction spans over a long sequence of segments, the time taken to retrieve 
the producer snapshots from the remote storage can, in the worst case, become 
redhibitory and block the reads if it consistently exceed the deadline of fetch 
requests ({{{}fetch.max.wait.ms{}}}).

Essentially, the method used to compute the uncommitted records to return have 
an asymptotic complexity proportional to the number of segments in the log. 
That is not a problem with local storage since the constant factor to traverse 
the files is small enough, but that is not the case with a remote storage which 
exhibits higher read latency. An aggravating factor was the lock contention in 
the remote index cache which was then mitigated by KAFKA-15084. But 
unfortunately, despite the improvements observed without said contention, the 
algorithmic complexity of the current method used to compute uncommitted 
records can always defeat any optimisation made on the remote read path.

Maybe we could start thinking (if not already) about a different construct 
which would reduce that complexity to O(1) - i.e. to make the computation 
independent from the number of segments and irrespective of the transaction 
spans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14777) Add support of topic id for OffsetCommitRequests in CommitRequestManager

2023-03-05 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14777:
-

 Summary: Add support of topic id for OffsetCommitRequests in 
CommitRequestManager
 Key: KAFKA-14777
 URL: https://issues.apache.org/jira/browse/KAFKA-14777
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez


Topic IDs have been introduced to the {{OffsetCommitRequest}} in KAFKA-14690. 
The consumer coordinator now generates these requests with topic ids when all 
topics present in the request have a resolved id.

This change was not added to the commit request manager to limit the scope of 
[PR-13240|https://github.com/apache/kafka/pull/13240]. The purpose of this PR 
is to extend the support of topic ids to this new component.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14779) Add ACL Authorizer integration test for authorized OffsetCommits with an unknown topic

2023-03-06 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14779:
-

 Summary: Add ACL Authorizer integration test for authorized 
OffsetCommits with an unknown topic
 Key: KAFKA-14779
 URL: https://issues.apache.org/jira/browse/KAFKA-14779
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez


Discovered as part of [PR-13240|https://github.com/apache/kafka/pull/13240),], 
it seems the use case where a group and topic have the necessary ACLs to allow 
for offsets for that topic and consumer group to be committed, but the topic is 
unknown by the broker (either by name or id), is not covered. This purpose of 
this ticket is to add this coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic

2023-03-06 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14780:
-

 Summary: Make 
RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
 Key: KAFKA-14780
 URL: https://issues.apache.org/jira/browse/KAFKA-14780
 Project: Kafka
  Issue Type: Test
Reporter: Alexandre Dupriez


The test `RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay` relies 
on the actual system clock which makes it frequently fail on my poor intellij 
setup.

 

The `RefreshingHttpsJwks` component creates and uses a scheduled executor 
service. We could expose the scheduling mechanism to be able to mock its 
behaviour. One way to do could be to use the `KafkaScheduler` which has a 
`MockScheduler` implementation which relies on `MockTime` instead of the real 
time clock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14793) Propagate topic ids to the group coordinator

2023-03-08 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14793:
-

 Summary: Propagate topic ids to the group coordinator
 Key: KAFKA-14793
 URL: https://issues.apache.org/jira/browse/KAFKA-14793
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez


KAFKA-14690 introduces topic ids in the OffsetCommit API in the request layer. 
Propagation of topic ids within the group coordinator has been left out of 
scope. Whether topic ids are re-mapped internally in the group coordinator or 
the group coordinator starts to rely on {{{}TopicIdPartition{}}}.

Note that with KAFKA-14690, the offset commit response data built by the 
coordinator includes topic names only, and topic ids need to be injected 
afterwards outside of the coordinator before serializing the response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests

2023-03-14 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14806:
-

 Summary: Add connection timeout in PlaintextSender used by 
SelectorTests
 Key: KAFKA-14806
 URL: https://issues.apache.org/jira/browse/KAFKA-14806
 Project: Kafka
  Issue Type: Test
Reporter: Alexandre Dupriez


Tests in `SelectorTest` can fail due to spurious connection timeouts. One 
example can be found in [this 
build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528]
 where the client connection the `PlaintextSender` tried to open could not be 
established before the test timed out.

It may be worth enforcing connection timeout and retries if this can add to the 
selector tests resiliency. Note that `PlaintextSender` is only used by the 
`SelectorTest` so the scope of the change would remain local.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker

2023-03-24 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14845:
-

 Summary: Broker ZNode creation can fail due to a session ID 
unknown to the broker
 Key: KAFKA-14845
 URL: https://issues.apache.org/jira/browse/KAFKA-14845
 Project: Kafka
  Issue Type: Bug
Reporter: Alexandre Dupriez
 Attachments: broker-registration.drawio

Our production environment faced a use case where registration of a broker 
failed due to the presence of a "conflicting" broker znode in Zookeeper. This 
case is not without familiarity to that fixed by KAFKA-6584 and induced by the 
Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today.

A network partition disturbed communication channels between the Kafka and 
Zookeeper clusters for about 20% of the brokers in the cluster. One of this 
broker was not able to re-register with Zookeeper and was excluded from the 
cluster until it was restarted. Broker logs show the failed registration due to 
a "conflicting" znode write which in this case is not covered by KAFKA-6584. 
The broker did not restart and was not unhealthy. In the following logs, the 
broker IP is 1.2.3.4.

The sequence of logs on the broker is as follows.

First, a connection is established with the Zookeeper node 3.

 
{code:java}
[2023-03-05 16:01:55,342] INFO Socket connection established, initiating 
session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 
(org.apache.zookeeper.ClientCnxn)
[2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, 
L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] 
(org.apache.zookeeper.ClientCnxnSocketNetty){code}
 

An existing Zookeeper session was expired, and upon reconnection, the Zookeeper 
state change handler was invoked. The creation of the ephemeral znode 
/brokers/ids/18 started on the controller thread.

 
{code:java}
[2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) 
(kafka.zk.KafkaZkClient){code}
 

The client "session" timed out after 6 seconds. Note the session is 0x0 and the 
absence of "{_}Session establishment complete{_}" log: the broker appears to 
have never received or processed the response from the Zookeeper node.

 
{code:java}
[2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from 
server in 6000ms for sessionid 0x0, closing socket connection and attempting 
reconnect (org.apache.zookeeper.ClientCnxn)
[2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, 
L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] 
(org.apache.zookeeper.ClientCnxnSocketNetty){code}
 

Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client 
started waiting on a new connection notification.

 
{code:java}
[2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until 
connected. (kafka.zookeeper.ZooKeeperClient){code}
 

A new connection was created with the Zookeeper node 1. Note that a valid (new) 
session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time.

 
{code:java}
[2023-03-05 16:02:02,037] INFO Socket connection established, initiating 
session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 
(org.apache.zookeeper.ClientCnxn)
[2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, 
L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] 
(org.apache.zookeeper.ClientCnxnSocketNetty)
[2023-03-05 16:02:03,054] INFO Session establishment complete on server 
zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 
(org.apache.zookeeper.ClientCnxn){code}
 

The Kafka ZK client is notified of the connection.

 
{code:java}
[2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. 
(kafka.zookeeper.ZooKeeperClient){code}
 

The broker sends the request to create the znode {{/brokers/ids/18}} which 
already exists. The error path implemented for KAFKA-6584 is then followed. 
However, in this case, the session owning the ephemeral node 
{{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last 
active Zookeeper session which the broker has recorded. And it is also 
different from the current session {{0x1006c6e0b830001}} 
({{{}72176813933264897{}}}), hence the recreation of the broker znode is not 
attempted.

 
{code:java}
[2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at 
/brokers/ids/18, node already exists and owner '216172783240153793' does not 
match current session '72176813933264897' 
(kafka.zk.KafkaZkClient$CheckedEphemeral)
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
        at 
kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821)
        at 
kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759)
        at 
kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726)
        at kafka.zk.Kaf

[jira] [Created] (KAFKA-14852) Propagate Topic Ids to the Group Coordinator for Offset Fetch

2023-03-27 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-14852:
-

 Summary: Propagate Topic Ids to the Group Coordinator for Offset 
Fetch
 Key: KAFKA-14852
 URL: https://issues.apache.org/jira/browse/KAFKA-14852
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez
 Fix For: 3.5.0


This task is the sibling of KAFKA-14793 which propagates topic ids in the group 
coordinator on the offset commit (write) path. The purpose of this JIRA is to 
change the interfaces of the group coordinator and group coordinator adapter to 
propagate topic ids in a similar way.

KAFKA-14691 will add the topic ids to the OffsetFetch API itself so that topic 
ids are propagated from clients to the coordinator on the offset fetch path. 

Changes to the persisted data model (group metadata and keys) are out of scope.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15038) Use topic id/name mapping from the Metadata cache in RLM

2023-05-30 Thread Alexandre Dupriez (Jira)
Alexandre Dupriez created KAFKA-15038:
-

 Summary: Use topic id/name mapping from the Metadata cache in RLM
 Key: KAFKA-15038
 URL: https://issues.apache.org/jira/browse/KAFKA-15038
 Project: Kafka
  Issue Type: Sub-task
Reporter: Alexandre Dupriez
Assignee: Alexandre Dupriez


Currently, the {{RemoteLogManager}} maintains its own cache of topic name to 
topic id 
[[1]|https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138]
 using the information provided during leadership changes, and removing the 
mapping upon receiving the notification of partition stopped.

It should be possible to re-use the mapping in a broker's metadata cache, 
removing the need for the RLM to build and update a local cache thereby 
duplicating the information in the metadata cache. It also allows to preserve a 
single source of authority regarding the association between topic names and 
ids.

[1] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138



--
This message was sent by Atlassian Jira
(v8.20.10#820010)