[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Affects Version/s: 3.4.1 > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.4.0, 3.4.1, 3.5.1 >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. > For brokers with only one log directory, this bug will result in preventing > the broker from shutting down as expected. > The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler > thread, and subsequent {{IOException}} are not handled, and the broker never > stops. > {code:java} > [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped > (kafka.server.ReplicaManager$LogDirFailureHandler){code} > Another consideration here is whether the {{LogDirNotFoundException}} should > terminate the log dir failure handler thread. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Affects Version/s: 3.4.0 > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.4.0, 3.5.1 >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. > For brokers with only one log directory, this bug will result in preventing > the broker from shutting down as expected. > The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler > thread, and subsequent {{IOException}} are not handled, and the broker never > stops. > {code:java} > [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped > (kafka.server.ReplicaManager$LogDirFailureHandler){code} > Another consideration here is whether the {{LogDirNotFoundException}} should > terminate the log dir failure handler thread. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. {code:java} [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler){code} Another consideration here is whether the {{LogDirNotFoundException}} should terminate the log dir failure handler thread. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. {code:java} [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler){code} > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){cod
[jira] [Assigned] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-15490: - Assignee: (was: Alexandre Dupriez) > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. > For brokers with only one log directory, this bug will result in preventing > the broker from shutting down as expected. > The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler > thread, and subsequent {{IOException}} are not handled, and the broker never > stops. > {code:java} > [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped > (kafka.server.ReplicaManager$LogDirFailureHandler){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. {code:java} [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler){code} was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. {code:java} [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler){code} > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log di
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The `{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code}
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. {code:java} [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler){code} was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The `{{{}ogDirNotFoundException{}}} then kills the log dir failure handler thread, and subsequent {{IOException}} are not handled, and the broker never stops. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The `LogDirNotFoundException` then kills the log dir failure handler thread, and subsequent `IOException` are not handled, and the broker never stops. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. The `LogDirNotFoundException` then kills the log dir failure handler thread, and subsequent `IOException` are not handled, and the broker never stops. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.Replic
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of t
[jira] [Comment Edited] (KAFKA-15609) Corrupted index uploaded to remote tier
[ https://issues.apache.org/jira/browse/KAFKA-15609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782229#comment-17782229 ] Alexandre Dupriez edited comment on KAFKA-15609 at 11/2/23 5:04 PM: The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping and can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, pages 657-668. was (Author: adupriez): The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping and can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, page 657-668. > Corrupted index uploaded to remote tier > --- > > Key: KAFKA-15609 > URL: https://issues.apache.org/jira/browse/KAFKA-15609 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Divij Vaidya >Priority: Minor > > While testing Tiered Storage, we have observed corrupt indexes being present > in remote tier. One such situation is covered here at > https://issues.apache.org/jira/browse/KAFKA-15401. This Jira presents another > such possible case of corruption. > Potential cause of index corruption: > We want to ensure that the file we are passing to RSM plugin contains all the > data which is present in MemoryByteBuffer i.e. we should have flushed the > MemoryByteBuffer to the file using force(). In Kafka, when we close a > segment, indexes are flushed asynchronously [1]. Hence, it might be possible > that when we are passing the file to RSM, the file doesn't contain flushed > data. Hence, we may end up uploading indexes which haven't been flushed yet. > Ideally, the contract should enforce that we force flush the content of > MemoryByteBuffer before we give the file for RSM. This will ensure that > indexes are not corrupted/incomplete. > [1] > [https://github.com/apache/kafka/blob/4150595b0a2e0f45f2827cebc60bcb6f6558745d/core/src/main/scala/kafka/log/UnifiedLog.scala#L1613] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-15609) Corrupted index uploaded to remote tier
[ https://issues.apache.org/jira/browse/KAFKA-15609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782229#comment-17782229 ] Alexandre Dupriez edited comment on KAFKA-15609 at 11/2/23 5:19 PM: The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process read-after-write consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, pages 657-668. was (Author: adupriez): The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process read-after-write consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping and can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, pages 657-668. > Corrupted index uploaded to remote tier > --- > > Key: KAFKA-15609 > URL: https://issues.apache.org/jira/browse/KAFKA-15609 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Divij Vaidya >Priority: Minor > > While testing Tiered Storage, we have observed corrupt indexes being present > in remote tier. One such situation is covered here at > https://issues.apache.org/jira/browse/KAFKA-15401. This Jira presents another > such possible case of corruption. > Potential cause of index corruption: > We want to ensure that the file we are passing to RSM plugin contains all the > data which is present in MemoryByteBuffer i.e. we should have flushed the > MemoryByteBuffer to the file using force(). In Kafka, when we close a > segment, indexes are flushed asynchronously [1]. Hence, it might be possible > that when we are passing the file to RSM, the file doesn't contain flushed > data. Hence, we may end up uploading indexes which haven't been flushed yet. > Ideally, the contract should enforce that we force flush the content of > MemoryByteBuffer before we give the file for RSM. This will ensure that > indexes are not corrupted/incomplete. > [1] > [https://github.com/apache/kafka/blob/4150595b0a2e0f45f2827cebc60bcb6f6558745d/core/src/main/scala/kafka/log/UnifiedLog.scala#L1613] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-15609) Corrupted index uploaded to remote tier
[ https://issues.apache.org/jira/browse/KAFKA-15609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782229#comment-17782229 ] Alexandre Dupriez edited comment on KAFKA-15609 at 11/2/23 5:04 PM: The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process read-after-write consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping and can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, pages 657-668. was (Author: adupriez): The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping and can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, pages 657-668. > Corrupted index uploaded to remote tier > --- > > Key: KAFKA-15609 > URL: https://issues.apache.org/jira/browse/KAFKA-15609 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Divij Vaidya >Priority: Minor > > While testing Tiered Storage, we have observed corrupt indexes being present > in remote tier. One such situation is covered here at > https://issues.apache.org/jira/browse/KAFKA-15401. This Jira presents another > such possible case of corruption. > Potential cause of index corruption: > We want to ensure that the file we are passing to RSM plugin contains all the > data which is present in MemoryByteBuffer i.e. we should have flushed the > MemoryByteBuffer to the file using force(). In Kafka, when we close a > segment, indexes are flushed asynchronously [1]. Hence, it might be possible > that when we are passing the file to RSM, the file doesn't contain flushed > data. Hence, we may end up uploading indexes which haven't been flushed yet. > Ideally, the contract should enforce that we force flush the content of > MemoryByteBuffer before we give the file for RSM. This will ensure that > indexes are not corrupted/incomplete. > [1] > [https://github.com/apache/kafka/blob/4150595b0a2e0f45f2827cebc60bcb6f6558745d/core/src/main/scala/kafka/log/UnifiedLog.scala#L1613] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15609) Corrupted index uploaded to remote tier
[ https://issues.apache.org/jira/browse/KAFKA-15609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782229#comment-17782229 ] Alexandre Dupriez commented on KAFKA-15609: --- The nature - private or shared - of a memory mapping have visibility implications between processes, but from within the same process consistency should always be guaranteed. "Flushing" a memory-mapped file to the block device can be initiated with the {{msync}} syscall but that operation is not necessary for the visibility guarantees which are questioned in this ticket. A succinct description of memory mapping and can be found in {_}Understanding the Linux Kernel, Third Edition{_}, edition O'Reilly, page 657-668. > Corrupted index uploaded to remote tier > --- > > Key: KAFKA-15609 > URL: https://issues.apache.org/jira/browse/KAFKA-15609 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Divij Vaidya >Priority: Minor > > While testing Tiered Storage, we have observed corrupt indexes being present > in remote tier. One such situation is covered here at > https://issues.apache.org/jira/browse/KAFKA-15401. This Jira presents another > such possible case of corruption. > Potential cause of index corruption: > We want to ensure that the file we are passing to RSM plugin contains all the > data which is present in MemoryByteBuffer i.e. we should have flushed the > MemoryByteBuffer to the file using force(). In Kafka, when we close a > segment, indexes are flushed asynchronously [1]. Hence, it might be possible > that when we are passing the file to RSM, the file doesn't contain flushed > data. Hence, we may end up uploading indexes which haven't been flushed yet. > Ideally, the contract should enforce that we force flush the content of > MemoryByteBuffer before we give the file for RSM. This will ensure that > indexes are not corrupted/incomplete. > [1] > [https://github.com/apache/kafka/blob/4150595b0a2e0f45f2827cebc60bcb6f6558745d/core/src/main/scala/kafka/log/UnifiedLog.scala#L1613] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Description: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return with the fetch response has an asymptotic complexity proportional to the number of segments in the log. This is not a problem with local storage since the constant factor to traverse the producer snapshot files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was mitigated by KAFKA-15084 since then. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. was: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return with the fetch response has an asymptotic complexity proportional to the number of segments in the log. This is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was mitigated by KAFKA-15084 since then. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > Labels: KIP-405 > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > with the fetch response has an asymptotic complexity proportional to the > number of segments in the log. This is not a problem with local storage since > the constant factor to traverse the producer snapshot files is small enough, > but that is not the case with a remote storage which exhibits higher read > latency. > An aggravating factor was the lock contention in the remote index cache which > was mitigated by KAFKA-15084 since then. But unfortunately, despite the > improvements observed without the said contention, the algorithmic complexity > of the current method used to compute uncommitted records can always defeat > any optimisation made on the remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Description: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return with the fetch response has an asymptotic complexity proportional to the number of segments in the log. This is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was mitigated by KAFKA-15084 since then. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. was: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return with the fetch response has an asymptotic complexity proportional to the number of segments in the log. This is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > Labels: KIP-405 > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > with the fetch response has an asymptotic complexity proportional to the > number of segments in the log. This is not a problem with local storage since > the constant factor to traverse the files is small enough, but that is not > the case with a remote storage which exhibits higher read latency. > An aggravating factor was the lock contention in the remote index cache which > was mitigated by KAFKA-15084 since then. But unfortunately, despite the > improvements observed without the said contention, the algorithmic complexity > of the current method used to compute uncommitted records can always defeat > any optimisation made on the remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of segments and irrespective of the spans of >
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Description: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return with the fetch response has an asymptotic complexity proportional to the number of segments in the log. This is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. was: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > Labels: KIP-405 > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > with the fetch response has an asymptotic complexity proportional to the > number of segments in the log. This is not a problem with local storage since > the constant factor to traverse the files is small enough, but that is not > the case with a remote storage which exhibits higher read latency. An > aggravating factor was the lock contention in the remote index cache which > was then mitigated by KAFKA-15084. But unfortunately, despite the > improvements observed without the said contention, the algorithmic complexity > of the current method used to compute uncommitted records can always defeat > any optimisation made on the remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of segments and irrespective of the spans of > transactions. -- This message wa
[jira] [Updated] (KAFKA-15301) [Tiered Storage] Historically compacted topics send request to remote for active segment during consume
[ https://issues.apache.org/jira/browse/KAFKA-15301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15301: -- Description: I have a use case where tiered storage plugin received requests for active segments. The topics for which it happened were historically compacted topics for which compaction was disabled and tiering was enabled. Create topic with compact cleanup policy -> Produce data with few repeat keys and create multiple segments -> let compaction happen -> change cleanup policy to delete -> produce some more data for segment rollover -> enable tiering on topic -> wait for segments to be uploaded to remote storage and cleanup from local (active segment would remain), consume from beginning -> Observe logs. was: In AWS MSK (Kafka 2.8) tiered storage a case surfaced where tiered storage plugin received requests for active segments. The topics for which it happened were historically compacted topics for which compaction was disabled and tiering was enabled. Create topic with compact cleanup policy -> Produce data with few repeat keys and create multiple segments -> let compaction happen -> change cleanup policy to delete -> produce some more data for segment rollover -> enable tiering on topic -> wait for segments to be uploaded to remote storage and cleanup from local (active segment would remain), consume from beginning -> Observe logs. > [Tiered Storage] Historically compacted topics send request to remote for > active segment during consume > --- > > Key: KAFKA-15301 > URL: https://issues.apache.org/jira/browse/KAFKA-15301 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.0 >Reporter: Mital Awachat >Assignee: Jimmy Wang >Priority: Major > Fix For: 3.7.0 > > > I have a use case where tiered storage plugin received requests for active > segments. The topics for which it happened were historically compacted topics > for which compaction was disabled and tiering was enabled. > Create topic with compact cleanup policy -> Produce data with few repeat keys > and create multiple segments -> let compaction happen -> change cleanup > policy to delete -> produce some more data for segment rollover -> enable > tiering on topic -> wait for segments to be uploaded to remote storage and > cleanup from local (active segment would remain), consume from beginning -> > Observe logs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Description: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the transaction spans. was: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the transaction spans. > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > Labels: KIP-405 > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > have an asymptotic complexity proportional to the number of segments in the > log. That is not a problem with local storage since the constant factor to > traverse the files is small enough, but that is not the case with a remote > storage which exhibits higher read latency. An aggravating factor was the > lock contention in the remote index cache which was then mitigated by > KAFKA-15084. But unfortunately, despite the improvements observed without the > said contention, the algorithmic complexity of the current method used to > compute uncommitted records can always defeat any optimisation made on the > remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of segments and irrespective of the transaction > spans. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Description: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions. was: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the transaction spans. > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > Labels: KIP-405 > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > have an asymptotic complexity proportional to the number of segments in the > log. That is not a problem with local storage since the constant factor to > traverse the files is small enough, but that is not the case with a remote > storage which exhibits higher read latency. An aggravating factor was the > lock contention in the remote index cache which was then mitigated by > KAFKA-15084. But unfortunately, despite the improvements observed without the > said contention, the algorithmic complexity of the current method used to > compute uncommitted records can always defeat any optimisation made on the > remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of segments and irrespective of the spans of > transactions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Description: I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the transaction spans. was: Hi team, I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the transaction spans. > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > have an asymptotic complexity proportional to the number of segments in the > log. That is not a problem with local storage since the constant factor to > traverse the files is small enough, but that is not the case with a remote > storage which exhibits higher read latency. An aggravating factor was the > lock contention in the remote index cache which was then mitigated by > KAFKA-15084. But unfortunately, despite the improvements observed without > said contention, the algorithmic complexity of the current method used to > compute uncommitted records can always defeat any optimisation made on the > remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of segments and irrespective of the transaction > spans. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
[ https://issues.apache.org/jira/browse/KAFKA-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15678: -- Labels: KIP-405 (was: ) > [Tiered Storage] Stall remote reads with long-spanning transactions > --- > > Key: KAFKA-15678 > URL: https://issues.apache.org/jira/browse/KAFKA-15678 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Affects Versions: 3.6.0 >Reporter: Alexandre Dupriez >Priority: Major > Labels: KIP-405 > > I am facing an issue on the remote data path for uncommitted reads. > As mentioned in [the original > PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a > transaction spans over a long sequence of segments, the time taken to > retrieve the producer snapshots from the remote storage can, in the worst > case, become redhibitory and block the reads if it consistently exceed the > deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). > Essentially, the method used to compute the uncommitted records to return > have an asymptotic complexity proportional to the number of segments in the > log. That is not a problem with local storage since the constant factor to > traverse the files is small enough, but that is not the case with a remote > storage which exhibits higher read latency. An aggravating factor was the > lock contention in the remote index cache which was then mitigated by > KAFKA-15084. But unfortunately, despite the improvements observed without > said contention, the algorithmic complexity of the current method used to > compute uncommitted records can always defeat any optimisation made on the > remote read path. > Maybe we could start thinking (if not already) about a different construct > which would reduce that complexity to O(1) - i.e. to make the computation > independent from the number of segments and irrespective of the transaction > spans. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15678) [Tiered Storage] Stall remote reads with long-spanning transactions
Alexandre Dupriez created KAFKA-15678: - Summary: [Tiered Storage] Stall remote reads with long-spanning transactions Key: KAFKA-15678 URL: https://issues.apache.org/jira/browse/KAFKA-15678 Project: Kafka Issue Type: Bug Components: Tiered-Storage Affects Versions: 3.6.0 Reporter: Alexandre Dupriez Hi team, I am facing an issue on the remote data path for uncommitted reads. As mentioned in [the original PR|https://github.com/apache/kafka/pull/13535#discussion_r1166887367], if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests ({{{}fetch.max.wait.ms{}}}). Essentially, the method used to compute the uncommitted records to return have an asymptotic complexity proportional to the number of segments in the log. That is not a problem with local storage since the constant factor to traverse the files is small enough, but that is not the case with a remote storage which exhibits higher read latency. An aggravating factor was the lock contention in the remote index cache which was then mitigated by KAFKA-15084. But unfortunately, despite the improvements observed without said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path. Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the transaction spans. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14482) Move LogLoader to storage module
[ https://issues.apache.org/jira/browse/KAFKA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774641#comment-17774641 ] Alexandre Dupriez edited comment on KAFKA-14482 at 10/12/23 5:56 PM: - [~ijuma] I am really sorry - I wish I had time to work on this and contribute but am already totally swamped (fortunately this time, only figuratively) :( 😭 was (Author: adupriez): [~ijuma] I am really sorry - I wish I had time to contribute but am already swamped (fortunately this time, only figuratively) :( 😭 > Move LogLoader to storage module > > > Key: KAFKA-14482 > URL: https://issues.apache.org/jira/browse/KAFKA-14482 > Project: Kafka > Issue Type: Sub-task >Reporter: Ismael Juma >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14482) Move LogLoader to storage module
[ https://issues.apache.org/jira/browse/KAFKA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774641#comment-17774641 ] Alexandre Dupriez commented on KAFKA-14482: --- [~ijuma] I am really sorry - I wish I had time to contribute but am already swamped (fortunately this time, only figuratively) :( 😭 > Move LogLoader to storage module > > > Key: KAFKA-14482 > URL: https://issues.apache.org/jira/browse/KAFKA-14482 > Project: Kafka > Issue Type: Sub-task >Reporter: Ismael Juma >Assignee: Alexandre Dupriez >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14482) Move LogLoader to storage module
[ https://issues.apache.org/jira/browse/KAFKA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14482: - Assignee: (was: Alexandre Dupriez) > Move LogLoader to storage module > > > Key: KAFKA-14482 > URL: https://issues.apache.org/jira/browse/KAFKA-14482 > Project: Kafka > Issue Type: Sub-task >Reporter: Ismael Juma >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel ({{{}{}}} is to be replaced with the actual path of the log directory): {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => > val dirPath = checkpoint.file.getAbsolutePath > logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while > writing meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-15490: - Assignee: Alexandre Dupriez > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => > val dirPath = checkpoint.file.getAbsolutePath > logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while > writing meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Affects Version/s: 3.5.1 > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.5.1 >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => > val dirPath = checkpoint.file.getAbsolutePath > logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while > writing meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.{code} An immediate fix for this is to use the {{logDir}} provided from to the checkpointing method instead of the path of the metadata file. was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} {code} {{[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.}} > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => > val dirPath = checkpoint.file.getAbsolutePath > logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while > writing meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15490: -- Description: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel {code:java} [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException{code} The log dir failure handler cannot lookup the log directory: {code:java} {code} {{[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.}} was: There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel {{}} {code:java} {code} {{[2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException}} The log dir failure handler cannot lookup the log directory: {{}} {code:java} {code} {{[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.}} > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > > {code:java} > case e: IOException => > val dirPath = checkpoint.file.getAbsolutePath > logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while > writing meta.properties to $dirPath", e){code} > > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > > The log dir failure handler cannot lookup the log directory: > {code:java} > {code} > {{[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.}} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
Alexandre Dupriez created KAFKA-15490: - Summary: Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint Key: KAFKA-15490 URL: https://issues.apache.org/jira/browse/KAFKA-15490 Project: Kafka Issue Type: Bug Components: core Reporter: Alexandre Dupriez There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). {code:java} case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e){code} As a result, after an {{IOException}} is captured and enqueued in the log dir failure channel {{}} {code:java} {code} {{[2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to /meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException}} The log dir failure handler cannot lookup the log directory: {{}} {code:java} {code} {{[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir /meta.properties is not found in the config.}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15486) Include NIO exceptions as I/O exceptions to be part of the disk failure handling mechanism
[ https://issues.apache.org/jira/browse/KAFKA-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15486: -- Description: Currently, Apache Kafka offers the ability to detect and capture I/O errors when accessing the file system via the standard {{IOException}} from the JDK. There are cases however, where I/O errors are only reported via exceptions such as {{{}BufferOverflowException{}}}, without associated {{IOException}} on the produce or read path, so that the data volume is not detected as unhealthy and not included in the list of offline directories. Specifically, we faced the following scenario on a broker: * The data volume hosting a log directory became saturated. * As expected, {{IOException}} were generated on the read/write path. * The log directory was set as offline and since it was the only log directory configured on the broker, Kafka automatically shut down. * Additional space was added to the data volume. * Kafka was then restarted. * No more {{IOException}} occurred, however {{BufferOverflowException}} *[*]* were raised while trying to delete log segments in oder to honour the retention settings of a topic. The log directory was not moved to offline and the exceptions kept re-occurring indefinitely. The retention settings were therefore not applied in this case. The mitigation consisted in restarting Kafka. It may be worth considering adding {{BufferOverflowException}} and {{BufferUnderflowException}} (and any other related exception from the JDK NIO library which surfaces an I/O error) to the current {{IOException}} as a proxy of storage I/O failure, although there may be known unintended consequences in doing so which is the reason they were not added already, or, it may be too marginal of an impact to modify the main I/O failure handing path to risk exposing it to such unknown unintended consequences. *[*]* {code:java} java.nio.BufferOverflowException at java.base/java.nio.Buffer.nextPutIndex(Buffer.java:674) at java.base/java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:882) at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134) at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114) at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:506) at kafka.log.Log.$anonfun$roll$8(Log.scala:2066) at kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:2066) at scala.Option.foreach(Option.scala:437) at kafka.log.Log.$anonfun$roll$2(Log.scala:2066) at kafka.log.Log.roll(Log.scala:2482) at kafka.log.Log.maybeRoll(Log.scala:2017) at kafka.log.Log.append(Log.scala:1292) at kafka.log.Log.appendAsFollower(Log.scala:1155) at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:1023) at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:1030) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:178) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:356) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:345) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:344) at kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355) at scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:344) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} was: Currently, Apache Kafka offers the ability to detect and capture I/O errors when accessing the file system via the standard {{IOException}} from the JDK. There are cases however, where I/O errors are only reported via exceptions such as {{{}BufferOverflowException{}}}, without associated {{IOException}} on the produce or read path, so that the data volu
[jira] [Updated] (KAFKA-15486) Include NIO exceptions as I/O exceptions to be part of the disk failure handling mechanism
[ https://issues.apache.org/jira/browse/KAFKA-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15486: -- Description: Currently, Apache Kafka offers the ability to detect and capture I/O errors when accessing the file system via the standard {{IOException}} from the JDK. There are cases however, where I/O errors are only reported via exceptions such as {{{}BufferOverflowException{}}}, without associated {{IOException}} on the produce or read path, so that the data volume is not detected as unhealthy and not included in the list of offline directories. Specifically, we faced the following scenario on a broker: * The data volume hosting a log directory became saturated. * As expected, {{IOException}} were generated on the read/write path. * The log directory was set as offline and since it was the only log directory configured on the broker, Kafka automatically shut down. * Additional space was added to the data volume. * Kafka was then restarted. * No more {{IOException}} occurred, however {{BufferOverflowException}} *[*]* were raised while trying to delete log segments in oder to honour the retention settings of a topic. The log directory was not moved to offline and the exceptions kept re-occurring indefinitely. The retention settings were therefore not applied in this case. The mitigation consisted in restarting Kafka. It may be worth considering adding {{BufferOverflowException}} and {{BufferUnderflowException}} (and any other related exception from the JDK NIO library which surfaces an I/O error) to the current {{IOException}} as a proxy of storage I/O failure, although there may be known unintended consequences in doing so which is the reason they were not added already, or, it may be too marginal of an impact to modify the main I/O failure handing path to risk exposing it to such unknown unintended consequences. *[*]* {code:java} java.nio.BufferOverflowException at java.base/java.nio.Buffer.nextPutIndex(Buffer.java:674) at java.base/java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:882) at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134) at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114) at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:506) at kafka.log.Log.$anonfun$roll$8(Log.scala:2066) at kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:2066) at scala.Option.foreach(Option.scala:437) at kafka.log.Log.$anonfun$roll$2(Log.scala:2066) at kafka.log.Log.roll(Log.scala:2482) at kafka.log.Log.maybeRoll(Log.scala:2017) at kafka.log.Log.append(Log.scala:1292) at kafka.log.Log.appendAsFollower(Log.scala:1155) at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:1023) at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:1030) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:178) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:356) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:345) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:344) at kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355) at scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:344) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} was: Currently, Apache Kafka offers the ability to detect and capture I/O errors when accessing the file system via the standard {{IOException}} from the JDK. There are cases however, where I/O errors are only reported via exceptions such as {{{}BufferOverflowException{}}}, without associated {{IOException}} on the produce or read path, so that the data volume
[jira] [Updated] (KAFKA-15486) Include NIO exceptions as I/O exceptions to be part of the disk failure handling mechanism
[ https://issues.apache.org/jira/browse/KAFKA-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15486: -- Summary: Include NIO exceptions as I/O exceptions to be part of the disk failure handling mechanism (was: Include NIO exceptions as I/O exceptions to be part of disk failure handling) > Include NIO exceptions as I/O exceptions to be part of the disk failure > handling mechanism > -- > > Key: KAFKA-15486 > URL: https://issues.apache.org/jira/browse/KAFKA-15486 > Project: Kafka > Issue Type: Improvement > Components: core, jbod >Reporter: Alexandre Dupriez >Priority: Minor > > Currently, Apache Kafka offers the ability to detect and capture I/O errors > when accessing the file system via the standard {{IOException}} from the JDK. > There are cases however, where I/O errors are only reported via exceptions > such as {{{}BufferOverflowException{}}}, without associated {{IOException}} > on the produce or read path, so that the data volume is not detected as > unhealthy and not included in the list of offline directories. > Specifically, we faced the following scenario on a broker: > * The data volume hosting a log directory became saturated. > * As expected, {{IOException}} were generated on the read/write path. > * The log directory was set as offline and since it was the only log > directory configured on the broker, Kafka automatically shut down. > * Additional space was added to the data volume. > * Kafka was then restarted. > * No more {{IOException}} occurred, however {{BufferOverflowException}} > *[*]* were raised while trying to delete log segments in oder to honour the > retention settings of a topic. The log directory was not moved to offline and > the exceptions kept re-occurring indefinitely. > The retention settings were therefore not applied in this case. The > mitigation consisted in restarting Kafka. > It may be worth considering adding {{BufferOverflowException}} and > {{BufferUnderflowException}} (and any other related exception from the JDK > NIO library which surfaces an I/O error) to the current {{IOException}} as a > proxy of storage I/O failure, although there may be known unintended > consequences in doing so which is the reason they were not added already, or, > it may be too marginal of an impact to modify the main I/O failure handing > path to risk exposing it to such unknown unintended consequences. > *[*]* > {code:java} > java.nio.BufferOverflowException at > java.base/java.nio.Buffer.nextPutIndex(Buffer.java:674) at > java.base/java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:882) at > kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134) at > kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114) at > kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:506) at > kafka.log.Log.$anonfun$roll$8(Log.scala:2066) at > kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:2066) at > scala.Option.foreach(Option.scala:437) at > kafka.log.Log.$anonfun$roll$2(Log.scala:2066) at > kafka.log.Log.roll(Log.scala:2482) at > kafka.log.Log.maybeRoll(Log.scala:2017) at > kafka.log.Log.append(Log.scala:1292) at > kafka.log.Log.appendAsFollower(Log.scala:1155) at > kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:1023) > at > kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:1030) > at > kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:178) > at > kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:356) > at scala.Option.foreach(Option.scala:437) at > kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:345) > at > kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:344) > at > kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62) > at > scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359) > at > scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355) > at > scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:344) > at > kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141) > at > kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140) >
[jira] [Created] (KAFKA-15486) Include NIO exceptions as I/O exceptions to be part of disk failure handling
Alexandre Dupriez created KAFKA-15486: - Summary: Include NIO exceptions as I/O exceptions to be part of disk failure handling Key: KAFKA-15486 URL: https://issues.apache.org/jira/browse/KAFKA-15486 Project: Kafka Issue Type: Improvement Components: core, jbod Reporter: Alexandre Dupriez Currently, Apache Kafka offers the ability to detect and capture I/O errors when accessing the file system via the standard {{IOException}} from the JDK. There are cases however, where I/O errors are only reported via exceptions such as {{{}BufferOverflowException{}}}, without associated {{IOException}} on the produce or read path, so that the data volume is not detected as unhealthy and not included in the list of offline directories. Specifically, we faced the following scenario on a broker: * The data volume hosting a log directory became saturated. * As expected, {{IOException}} were generated on the read/write path. * The log directory was set as offline and since it was the only log directory configured on the broker, Kafka automatically shut down. * Additional space was added to the data volume. * Kafka was then restarted. * No more {{IOException}} occurred, however {{BufferOverflowException}} *[*]* were raised while trying to delete log segments in oder to honour the retention settings of a topic. The log directory was not moved to offline and the exceptions kept re-occurring indefinitely. The retention settings were therefore not applied in this case. The mitigation consisted in restarting Kafka. It may be worth considering adding {{BufferOverflowException}} and {{BufferUnderflowException}} (and any other related exception from the JDK NIO library which surfaces an I/O error) to the current {{IOException}} as a proxy of storage I/O failure, although there may be known unintended consequences in doing so which is the reason they were not added already, or, it may be too marginal of an impact to modify the main I/O failure handing path to risk exposing it to such unknown unintended consequences. *[*]* {code:java} java.nio.BufferOverflowException at java.base/java.nio.Buffer.nextPutIndex(Buffer.java:674) at java.base/java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:882) at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134) at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114) at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:506) at kafka.log.Log.$anonfun$roll$8(Log.scala:2066) at kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:2066) at scala.Option.foreach(Option.scala:437) at kafka.log.Log.$anonfun$roll$2(Log.scala:2066) at kafka.log.Log.roll(Log.scala:2482) at kafka.log.Log.maybeRoll(Log.scala:2017) at kafka.log.Log.append(Log.scala:1292) at kafka.log.Log.appendAsFollower(Log.scala:1155) at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:1023) at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:1030) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:178) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:356) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:345) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:344) at kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355) at scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:344) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15038) Use topic id/name mapping from the Metadata cache in RLM
[ https://issues.apache.org/jira/browse/KAFKA-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15038: -- Component/s: core > Use topic id/name mapping from the Metadata cache in RLM > > > Key: KAFKA-15038 > URL: https://issues.apache.org/jira/browse/KAFKA-15038 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Currently, the {{RemoteLogManager}} maintains its own cache of topic name to > topic id > [[1]|https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138] > using the information provided during leadership changes, and removing the > mapping upon receiving the notification of partition stopped. > It should be possible to re-use the mapping in a broker's metadata cache, > removing the need for the RLM to build and update a local cache thereby > duplicating the information in the metadata cache. It also allows to preserve > a single source of authority regarding the association between topic names > and ids. > [1] > https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15038) Use topic id/name mapping from the Metadata cache in the RemoteLogManager
[ https://issues.apache.org/jira/browse/KAFKA-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-15038: -- Summary: Use topic id/name mapping from the Metadata cache in the RemoteLogManager (was: Use topic id/name mapping from the Metadata cache in RLM) > Use topic id/name mapping from the Metadata cache in the RemoteLogManager > - > > Key: KAFKA-15038 > URL: https://issues.apache.org/jira/browse/KAFKA-15038 > Project: Kafka > Issue Type: Sub-task > Components: core >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Currently, the {{RemoteLogManager}} maintains its own cache of topic name to > topic id > [[1]|https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138] > using the information provided during leadership changes, and removing the > mapping upon receiving the notification of partition stopped. > It should be possible to re-use the mapping in a broker's metadata cache, > removing the need for the RLM to build and update a local cache thereby > duplicating the information in the metadata cache. It also allows to preserve > a single source of authority regarding the association between topic names > and ids. > [1] > https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15038) Use topic id/name mapping from the Metadata cache in RLM
Alexandre Dupriez created KAFKA-15038: - Summary: Use topic id/name mapping from the Metadata cache in RLM Key: KAFKA-15038 URL: https://issues.apache.org/jira/browse/KAFKA-15038 Project: Kafka Issue Type: Sub-task Reporter: Alexandre Dupriez Assignee: Alexandre Dupriez Currently, the {{RemoteLogManager}} maintains its own cache of topic name to topic id [[1]|https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138] using the information provided during leadership changes, and removing the mapping upon receiving the notification of partition stopped. It should be possible to re-use the mapping in a broker's metadata cache, removing the need for the RLM to build and update a local cache thereby duplicating the information in the metadata cache. It also allows to preserve a single source of authority regarding the association between topic names and ids. [1] https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L138 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-7739) Kafka Tiered Storage
[ https://issues.apache.org/jira/browse/KAFKA-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721025#comment-17721025 ] Alexandre Dupriez commented on KAFKA-7739: -- Got it. Sorry I misunderstood your suggestion. Thanks for clarifying! (and agree with it). > Kafka Tiered Storage > > > Key: KAFKA-7739 > URL: https://issues.apache.org/jira/browse/KAFKA-7739 > Project: Kafka > Issue Type: New Feature > Components: core >Reporter: Harsha >Assignee: Satish Duggana >Priority: Major > Labels: needs-kip > Fix For: 3.6.0 > > > KIP: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-7739) Kafka Tiered Storage
[ https://issues.apache.org/jira/browse/KAFKA-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720947#comment-17720947 ] Alexandre Dupriez commented on KAFKA-7739: -- I would think that the absence of the log segment or any mandatory ancillary file results in the stated exception. I think only transaction indexes are optional. > Kafka Tiered Storage > > > Key: KAFKA-7739 > URL: https://issues.apache.org/jira/browse/KAFKA-7739 > Project: Kafka > Issue Type: New Feature > Components: core >Reporter: Harsha >Assignee: Satish Duggana >Priority: Major > Labels: needs-kip > Fix For: 3.6.0 > > > KIP: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14928) Metrics collection contends on lock with log cleaning
[ https://issues.apache.org/jira/browse/KAFKA-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715041#comment-17715041 ] Alexandre Dupriez edited comment on KAFKA-14928 at 4/21/23 2:55 PM: Hi Divij, thanks for reporting this. Would you have a reproduction case which exhibits the contention? was (Author: adupriez): Hi Divij, thanks for reporting this. Would you have a reproduction case which demonstrates the contention? > Metrics collection contends on lock with log cleaning > - > > Key: KAFKA-14928 > URL: https://issues.apache.org/jira/browse/KAFKA-14928 > Project: Kafka > Issue Type: Bug >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.6.0 > > > In LogCleanerManager.scala, calculation of a metric requires a lock [1]. This > same lock is required by core log cleaner functionality such as > "grabFilthiestCompactedLog". This might lead to a situation where metric > calculation holding the lock for an extended period of time may affect the > core functionality of log cleaning. > This outcome of this task is to prevent expensive metric calculation from > blocking log cleaning/compaction activity. > [1] > https://github.com/apache/kafka/blob/dd63d88ac3ea7a9a55a6dacf9c5473e939322a55/core/src/main/scala/kafka/log/LogCleanerManager.scala#L102 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14928) Metrics collection contends on lock with log cleaning
[ https://issues.apache.org/jira/browse/KAFKA-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715041#comment-17715041 ] Alexandre Dupriez commented on KAFKA-14928: --- Hi Divij, thanks for reporting this. Would you have a reproduction case which demonstrates the contention? > Metrics collection contends on lock with log cleaning > - > > Key: KAFKA-14928 > URL: https://issues.apache.org/jira/browse/KAFKA-14928 > Project: Kafka > Issue Type: Bug >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.6.0 > > > In LogCleanerManager.scala, calculation of a metric requires a lock [1]. This > same lock is required by core log cleaner functionality such as > "grabFilthiestCompactedLog". This might lead to a situation where metric > calculation holding the lock for an extended period of time may affect the > core functionality of log cleaning. > This outcome of this task is to prevent expensive metric calculation from > blocking log cleaning/compaction activity. > [1] > https://github.com/apache/kafka/blob/dd63d88ac3ea7a9a55a6dacf9c5473e939322a55/core/src/main/scala/kafka/log/LogCleanerManager.scala#L102 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: (was: broker-registration.drawio (4).png) > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log, phoque.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at > org.apache.zo
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: phoque.png > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log, phoque.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at > org.apache.zookeeper.KeeperException.create(Keep
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: broker-registration.drawio (4).png > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio (4).png, kafka-broker-reg.log > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at >
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Summary: Broker ZNode creation can fail due to lost Zookeeper Session ID (was: Broker ZNode creation can fail due to a session ID unknown to the broker) > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperExcepti
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: kafka-broker-reg.log > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at > org.apache.zookeeper.KeeperExcep
[jira] [Comment Edited] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705998#comment-17705998 ] Alexandre Dupriez edited comment on KAFKA-14845 at 3/28/23 2:10 PM: The reproduction test case has been update and is available [in github|https://github.com/Hangleton/kafka-tools/tree/master/kafka-broker-reg]. The logs of a run of this test have been attached to this ticket. It does not require any forced session renewal but just reproduce the use case using: * Connection delay * Response drops * Session expiration delay was (Author: adupriez): The reproduction test case has been update and is available [in github|https://github.com/Hangleton/kafka-tools/tree/master/kafka-broker-reg]. It does not require any forced session renewal but just reproduce the use case using: * Connection delay * Response drops * Session expiration delay > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > How
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: (was: broker-registration.drawio.png) > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at > org.apache.zookeeper.KeeperException.create(KeeperExcepti
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Commented] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705998#comment-17705998 ] Alexandre Dupriez commented on KAFKA-14845: --- The reproduction test case has been update and is available [in github|https://github.com/Hangleton/kafka-tools/tree/master/kafka-broker-reg]. It does not require any forced session renewal but just reproduce the use case using: * Connection delay * Response drops * Session expiration delay > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while cre
[jira] [Comment Edited] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705925#comment-17705925 ] Alexandre Dupriez edited comment on KAFKA-14845 at 3/28/23 10:15 AM: - I could reproduce without forcefully renewing the ZK session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages under its authority (including znode creation) even though the clients never receives the response for session creation hence is never becomes aware of that session. As a result, an ephemeral znode can be assigned a ZK session ID which the client is completely unaware of. Depending on the session establishment guarantees Zookeeper want to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. was (Author: adupriez): I could reproduce without forcefully renewing the ZK session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages under its authority (including znode creation) even though the clients never receives the response for session creation hence is never becomes aware of that session. As a result, an ephemeral znode can be assigned a ZK session ID which the client is completely unaware of. Depending on the session establishment guarantees Zookeeper want to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.Cli
[jira] [Comment Edited] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705925#comment-17705925 ] Alexandre Dupriez edited comment on KAFKA-14845 at 3/28/23 10:15 AM: - I could reproduce without forcefully renewing the ZK session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages under its authority (including znode creation) even though the clients never receives the response for session creation hence is never becomes aware of that session. As a result, an ephemeral znode can be assigned a ZK session ID which the client is completely unaware of. Depending on the session establishment guarantees Zookeeper wants to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. was (Author: adupriez): I could reproduce without forcefully renewing the ZK session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages under its authority (including znode creation) even though the clients never receives the response for session creation hence is never becomes aware of that session. As a result, an ephemeral znode can be assigned a ZK session ID which the client is completely unaware of. Depending on the session establishment guarantees Zookeeper want to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.Cl
[jira] [Comment Edited] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705925#comment-17705925 ] Alexandre Dupriez edited comment on KAFKA-14845 at 3/28/23 10:14 AM: - I could reproduce without forcefully renewing the ZK session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages under its authority (including znode creation) even though the clients never receives the response for session creation hence is never becomes aware of that session. As a result, an ephemeral znode can be assigned a ZK session ID which the client is completely unaware of. Depending on the session establishment guarantees Zookeeper want to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. was (Author: adupriez): I could reproduce without forcefully renewing the session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages (including znode creation) even though the clients never receives the response for the session creation hence is not aware of that session. As a result, an ephemeral znode can be assigned a session which the client is completely unaware of. This is exactly what happened to the NR cluster. Depending on the session establishment guarantees Zookeeper want to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache
[jira] [Commented] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705925#comment-17705925 ] Alexandre Dupriez commented on KAFKA-14845: --- I could reproduce without forcefully renewing the session. In a nutshell, it is possible (at least with the Netty client for ZK used for reproduction and run in production) to have the Zookeeper server create an active session then process messages (including znode creation) even though the clients never receives the response for the session creation hence is not aware of that session. As a result, an ephemeral znode can be assigned a session which the client is completely unaware of. This is exactly what happened to the NR cluster. Depending on the session establishment guarantees Zookeeper want to enforce, this may qualify as a bug in Zookeeper - or remain a feature by design. Irrespectively, the key on the client side (Kafka) is to preserve the identity of the broker instance across multiple ZK sessions. Currently, that identity is conveyed by the session id, which explains why missing one session is a problem. An alternative could be to have brokers generate a UUID at start-up and include it in ephemeral znode so that there is no ambiguity on the ephemeral owner. > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zooke
[jira] [Assigned] (KAFKA-14852) Propagate Topic Ids to the Group Coordinator for Offset Fetch
[ https://issues.apache.org/jira/browse/KAFKA-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14852: - Assignee: Alexandre Dupriez > Propagate Topic Ids to the Group Coordinator for Offset Fetch > - > > Key: KAFKA-14852 > URL: https://issues.apache.org/jira/browse/KAFKA-14852 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Major > Fix For: 3.5.0 > > > This task is the sibling of KAFKA-14793 which propagates topic ids in the > group coordinator on the offset commit (write) path. The purpose of this JIRA > is to change the interfaces of the group coordinator and group coordinator > adapter to propagate topic ids in a similar way. > KAFKA-14691 will add the topic ids to the OffsetFetch API itself so that > topic ids are propagated from clients to the coordinator on the offset fetch > path. > Changes to the persisted data model (group metadata and keys) are out of > scope. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14852) Propagate Topic Ids to the Group Coordinator for Offset Fetch
[ https://issues.apache.org/jira/browse/KAFKA-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14852: -- Description: This task is the sibling of KAFKA-14793 which propagates topic ids in the group coordinator on the offset commit (write) path. The purpose of this JIRA is to change the interfaces of the group coordinator and its adapter to propagate topic ids in a similar way. KAFKA-14691 will add the topic ids to the OffsetFetch API itself so that topic ids are propagated from clients to the coordinator on the offset fetch path. Changes to the persisted data model (group metadata and keys) are out of scope. was: This task is the sibling of KAFKA-14793 which propagates topic ids in the group coordinator on the offset commit (write) path. The purpose of this JIRA is to change the interfaces of the group coordinator and group coordinator adapter to propagate topic ids in a similar way. KAFKA-14691 will add the topic ids to the OffsetFetch API itself so that topic ids are propagated from clients to the coordinator on the offset fetch path. Changes to the persisted data model (group metadata and keys) are out of scope. > Propagate Topic Ids to the Group Coordinator for Offset Fetch > - > > Key: KAFKA-14852 > URL: https://issues.apache.org/jira/browse/KAFKA-14852 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Major > Fix For: 3.5.0 > > > This task is the sibling of KAFKA-14793 which propagates topic ids in the > group coordinator on the offset commit (write) path. The purpose of this JIRA > is to change the interfaces of the group coordinator and its adapter to > propagate topic ids in a similar way. > KAFKA-14691 will add the topic ids to the OffsetFetch API itself so that > topic ids are propagated from clients to the coordinator on the offset fetch > path. > Changes to the persisted data model (group metadata and keys) are out of > scope. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14690) OffsetCommit API Version 9
[ https://issues.apache.org/jira/browse/KAFKA-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14690: -- Fix Version/s: 3.5.0 > OffsetCommit API Version 9 > -- > > Key: KAFKA-14690 > URL: https://issues.apache.org/jira/browse/KAFKA-14690 > Project: Kafka > Issue Type: Sub-task >Reporter: David Jacot >Assignee: Alexandre Dupriez >Priority: Major > Fix For: 3.5.0 > > > The goal of this jira is to implement the version 9 of the OffsetCommit API > as described in KIP-848: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol#KIP848:TheNextGenerationoftheConsumerRebalanceProtocol-OffsetCommitAPI. > The version 9 mainly adds the support for topic ids. The consumer and the > admin client must be updated accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14852) Propagate Topic Ids to the Group Coordinator for Offset Fetch
Alexandre Dupriez created KAFKA-14852: - Summary: Propagate Topic Ids to the Group Coordinator for Offset Fetch Key: KAFKA-14852 URL: https://issues.apache.org/jira/browse/KAFKA-14852 Project: Kafka Issue Type: Sub-task Reporter: Alexandre Dupriez Fix For: 3.5.0 This task is the sibling of KAFKA-14793 which propagates topic ids in the group coordinator on the offset commit (write) path. The purpose of this JIRA is to change the interfaces of the group coordinator and group coordinator adapter to propagate topic ids in a similar way. KAFKA-14691 will add the topic ids to the OffsetFetch API itself so that topic ids are propagated from clients to the coordinator on the offset fetch path. Changes to the persisted data model (group metadata and keys) are out of scope. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(Kafka
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case is not covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853)
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case is not covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853)
[jira] [Assigned] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14845: - Assignee: Alexandre Dupriez > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case is not covered by > KAFKA-6584. The broker did not restart and was not unhealthy. In the > following logs, the broker IP is 1.2.3.4. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsExcept
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case is not covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853)
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case is not covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853)
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: broker-registration.drawio.png > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case is not covered by > KAFKA-6584. The broker did not restart and was not unhealthy. In the > following logs, the broker IP is 1.2.3.4. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExi
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: (was: broker-registration.drawio) > Broker ZNode creation can fail due to a session ID unknown to the broker > > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Priority: Minor > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case is not covered by > KAFKA-6584. The broker did not restart and was not unhealthy. In the > following logs, the broker IP is 1.2.3.4. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at > org.apache.zookeeper.KeeperEx
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case is not covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853)
[jira] [Created] (KAFKA-14845) Broker ZNode creation can fail due to a session ID unknown to the broker
Alexandre Dupriez created KAFKA-14845: - Summary: Broker ZNode creation can fail due to a session ID unknown to the broker Key: KAFKA-14845 URL: https://issues.apache.org/jira/browse/KAFKA-14845 Project: Kafka Issue Type: Bug Reporter: Alexandre Dupriez Attachments: broker-registration.drawio Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case is not covered by KAFKA-6584. The broker did not restart and was not unhealthy. In the following logs, the broker IP is 1.2.3.4. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.Kaf
[jira] [Updated] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests
[ https://issues.apache.org/jira/browse/KAFKA-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14806: -- Description: Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One example can be found in [this build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] where the client connection the `PlaintextSender` tried to open could not be established before the test timed out. It may be worth enforcing connection timeout and retries if this can add to the selector tests resiliency. Note that {{PlaintextSender}} is only used by the {{SelectorTest}} so the scope of the change would remain local. was: Tests in `SelectorTest` can fail due to spurious connection timeouts. One example can be found in [this build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] where the client connection the `PlaintextSender` tried to open could not be established before the test timed out. It may be worth enforcing connection timeout and retries if this can add to the selector tests resiliency. Note that `PlaintextSender` is only used by the `SelectorTest` so the scope of the change would remain local. > Add connection timeout in PlaintextSender used by SelectorTests > --- > > Key: KAFKA-14806 > URL: https://issues.apache.org/jira/browse/KAFKA-14806 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One > example can be found in [this > build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] > where the client connection the `PlaintextSender` tried to open could not be > established before the test timed out. > It may be worth enforcing connection timeout and retries if this can add to > the selector tests resiliency. Note that {{PlaintextSender}} is only used by > the {{SelectorTest}} so the scope of the change would remain local. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests
[ https://issues.apache.org/jira/browse/KAFKA-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14806: -- Description: Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One example can be found in [this build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] where the client connection the {{PlaintextSender}} tried to open could not be established before the test timed out. It may be worth enforcing connection timeout and retries if this can add to the selector tests resiliency. Note that {{PlaintextSender}} is only used by the {{SelectorTest}} so the scope of the change would remain local. was: Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One example can be found in [this build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] where the client connection the `PlaintextSender` tried to open could not be established before the test timed out. It may be worth enforcing connection timeout and retries if this can add to the selector tests resiliency. Note that {{PlaintextSender}} is only used by the {{SelectorTest}} so the scope of the change would remain local. > Add connection timeout in PlaintextSender used by SelectorTests > --- > > Key: KAFKA-14806 > URL: https://issues.apache.org/jira/browse/KAFKA-14806 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One > example can be found in [this > build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] > where the client connection the {{PlaintextSender}} tried to open could not > be established before the test timed out. > It may be worth enforcing connection timeout and retries if this can add to > the selector tests resiliency. Note that {{PlaintextSender}} is only used by > the {{SelectorTest}} so the scope of the change would remain local. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests
[ https://issues.apache.org/jira/browse/KAFKA-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14806: - Assignee: Alexandre Dupriez > Add connection timeout in PlaintextSender used by SelectorTests > --- > > Key: KAFKA-14806 > URL: https://issues.apache.org/jira/browse/KAFKA-14806 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Tests in `SelectorTest` can fail due to spurious connection timeouts. One > example can be found in [this > build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] > where the client connection the `PlaintextSender` tried to open could not be > established before the test timed out. > It may be worth enforcing connection timeout and retries if this can add to > the selector tests resiliency. Note that `PlaintextSender` is only used by > the `SelectorTest` so the scope of the change would remain local. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests
Alexandre Dupriez created KAFKA-14806: - Summary: Add connection timeout in PlaintextSender used by SelectorTests Key: KAFKA-14806 URL: https://issues.apache.org/jira/browse/KAFKA-14806 Project: Kafka Issue Type: Test Reporter: Alexandre Dupriez Tests in `SelectorTest` can fail due to spurious connection timeouts. One example can be found in [this build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] where the client connection the `PlaintextSender` tried to open could not be established before the test timed out. It may be worth enforcing connection timeout and retries if this can add to the selector tests resiliency. Note that `PlaintextSender` is only used by the `SelectorTest` so the scope of the change would remain local. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14793) Propagate Topic Ids to the Group Coordinator during Offsets Commit
[ https://issues.apache.org/jira/browse/KAFKA-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14793: -- Priority: Major (was: Minor) > Propagate Topic Ids to the Group Coordinator during Offsets Commit > -- > > Key: KAFKA-14793 > URL: https://issues.apache.org/jira/browse/KAFKA-14793 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Major > > KAFKA-14690 introduces topic ids in the OffsetCommit API in the request > layer. Propagation of topic ids within the group coordinator has been left > out of scope. Whether topic ids are re-mapped internally in the group > coordinator or the group coordinator starts to rely on > {{{}TopicIdPartition{}}}. > Note that with KAFKA-14690, the offset commit response data built by the > coordinator includes topic names only, and topic ids need to be injected > afterwards outside of the coordinator before serializing the response. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14793) Propagate Topic Ids to the Group Coordinator during Offsets Commit
[ https://issues.apache.org/jira/browse/KAFKA-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14793: -- Summary: Propagate Topic Ids to the Group Coordinator during Offsets Commit (was: Propagate Topic ids to the Group Coordinator during Offsets Commit) > Propagate Topic Ids to the Group Coordinator during Offsets Commit > -- > > Key: KAFKA-14793 > URL: https://issues.apache.org/jira/browse/KAFKA-14793 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > KAFKA-14690 introduces topic ids in the OffsetCommit API in the request > layer. Propagation of topic ids within the group coordinator has been left > out of scope. Whether topic ids are re-mapped internally in the group > coordinator or the group coordinator starts to rely on > {{{}TopicIdPartition{}}}. > Note that with KAFKA-14690, the offset commit response data built by the > coordinator includes topic names only, and topic ids need to be injected > afterwards outside of the coordinator before serializing the response. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14793) Propagate Topic ids to the Group Coordinator during Offsets Commit
[ https://issues.apache.org/jira/browse/KAFKA-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14793: - Assignee: Alexandre Dupriez > Propagate Topic ids to the Group Coordinator during Offsets Commit > -- > > Key: KAFKA-14793 > URL: https://issues.apache.org/jira/browse/KAFKA-14793 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > KAFKA-14690 introduces topic ids in the OffsetCommit API in the request > layer. Propagation of topic ids within the group coordinator has been left > out of scope. Whether topic ids are re-mapped internally in the group > coordinator or the group coordinator starts to rely on > {{{}TopicIdPartition{}}}. > Note that with KAFKA-14690, the offset commit response data built by the > coordinator includes topic names only, and topic ids need to be injected > afterwards outside of the coordinator before serializing the response. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14793) Propagate Topic ids to the Group Coordinator during Offsets Commit
[ https://issues.apache.org/jira/browse/KAFKA-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14793: -- Summary: Propagate Topic ids to the Group Coordinator during Offsets Commit (was: Propagate topic ids to the group coordinator) > Propagate Topic ids to the Group Coordinator during Offsets Commit > -- > > Key: KAFKA-14793 > URL: https://issues.apache.org/jira/browse/KAFKA-14793 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Priority: Minor > > KAFKA-14690 introduces topic ids in the OffsetCommit API in the request > layer. Propagation of topic ids within the group coordinator has been left > out of scope. Whether topic ids are re-mapped internally in the group > coordinator or the group coordinator starts to rely on > {{{}TopicIdPartition{}}}. > Note that with KAFKA-14690, the offset commit response data built by the > coordinator includes topic names only, and topic ids need to be injected > afterwards outside of the coordinator before serializing the response. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14793) Propagate topic ids to the group coordinator
Alexandre Dupriez created KAFKA-14793: - Summary: Propagate topic ids to the group coordinator Key: KAFKA-14793 URL: https://issues.apache.org/jira/browse/KAFKA-14793 Project: Kafka Issue Type: Sub-task Reporter: Alexandre Dupriez KAFKA-14690 introduces topic ids in the OffsetCommit API in the request layer. Propagation of topic ids within the group coordinator has been left out of scope. Whether topic ids are re-mapped internally in the group coordinator or the group coordinator starts to rely on {{{}TopicIdPartition{}}}. Note that with KAFKA-14690, the offset commit response data built by the coordinator includes topic names only, and topic ids need to be injected afterwards outside of the coordinator before serializing the response. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
[ https://issues.apache.org/jira/browse/KAFKA-14780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14780: - Assignee: Alexandre Dupriez > Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay > deterministic > > > Key: KAFKA-14780 > URL: https://issues.apache.org/jira/browse/KAFKA-14780 > Project: Kafka > Issue Type: Improvement >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} > relies on the actual system clock which makes it frequently fail on my poor > intellij setup. > > The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled > executor service. We could expose the scheduling mechanism to be able to mock > its behaviour. One way to do could be to use the {{KafkaScheduler}} which has > a {{MockScheduler}} implementation which relies on {{MockTime}} instead of > the real time clock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
[ https://issues.apache.org/jira/browse/KAFKA-14780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14780: -- Description: The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} relies on the actual system clock which makes it frequently fail on my poor intellij setup. The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled executor service. We could expose the scheduling mechanism to be able to mock its behaviour. One way to do could be to use the {{KafkaScheduler}} which has a {{MockScheduler}} implementation which relies on {{MockTime}} instead of the real time clock. was: The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} relies on the actual system clock which makes it frequently fail on my poor intellij setup. The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled executor service. We could expose the scheduling mechanism to be able to mock its behaviour. One way to do could be to use the `KafkaScheduler` which has a {{MockScheduler}} implementation which relies on {{MockTime}} instead of the real time clock. > Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay > deterministic > > > Key: KAFKA-14780 > URL: https://issues.apache.org/jira/browse/KAFKA-14780 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Priority: Minor > > The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} > relies on the actual system clock which makes it frequently fail on my poor > intellij setup. > > The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled > executor service. We could expose the scheduling mechanism to be able to mock > its behaviour. One way to do could be to use the {{KafkaScheduler}} which has > a {{MockScheduler}} implementation which relies on {{MockTime}} instead of > the real time clock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
[ https://issues.apache.org/jira/browse/KAFKA-14780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14780: -- Issue Type: Improvement (was: Test) > Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay > deterministic > > > Key: KAFKA-14780 > URL: https://issues.apache.org/jira/browse/KAFKA-14780 > Project: Kafka > Issue Type: Improvement >Reporter: Alexandre Dupriez >Priority: Minor > > The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} > relies on the actual system clock which makes it frequently fail on my poor > intellij setup. > > The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled > executor service. We could expose the scheduling mechanism to be able to mock > its behaviour. One way to do could be to use the {{KafkaScheduler}} which has > a {{MockScheduler}} implementation which relies on {{MockTime}} instead of > the real time clock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
[ https://issues.apache.org/jira/browse/KAFKA-14780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14780: -- Description: The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} relies on the actual system clock which makes it frequently fail on my poor intellij setup. The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled executor service. We could expose the scheduling mechanism to be able to mock its behaviour. One way to do could be to use the `KafkaScheduler` which has a {{MockScheduler}} implementation which relies on {{MockTime}} instead of the real time clock. was: The test `RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay` relies on the actual system clock which makes it frequently fail on my poor intellij setup. The `RefreshingHttpsJwks` component creates and uses a scheduled executor service. We could expose the scheduling mechanism to be able to mock its behaviour. One way to do could be to use the `KafkaScheduler` which has a `MockScheduler` implementation which relies on `MockTime` instead of the real time clock. > Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay > deterministic > > > Key: KAFKA-14780 > URL: https://issues.apache.org/jira/browse/KAFKA-14780 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Priority: Minor > > The test {{RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay}} > relies on the actual system clock which makes it frequently fail on my poor > intellij setup. > > The {{{}RefreshingHttpsJwks{}}}`component creates and uses a scheduled > executor service. We could expose the scheduling mechanism to be able to mock > its behaviour. One way to do could be to use the `KafkaScheduler` which has a > {{MockScheduler}} implementation which relies on {{MockTime}} instead of the > real time clock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
[ https://issues.apache.org/jira/browse/KAFKA-14780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696882#comment-17696882 ] Alexandre Dupriez commented on KAFKA-14780: --- [~kirktrue], would you agree with the approach? This requires a minor change in `RefreshingHttpsJwks` to expose the scheduler that it uses. Thanks! > Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay > deterministic > > > Key: KAFKA-14780 > URL: https://issues.apache.org/jira/browse/KAFKA-14780 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Priority: Minor > > The test `RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay` > relies on the actual system clock which makes it frequently fail on my poor > intellij setup. > > The `RefreshingHttpsJwks` component creates and uses a scheduled executor > service. We could expose the scheduling mechanism to be able to mock its > behaviour. One way to do could be to use the `KafkaScheduler` which has a > `MockScheduler` implementation which relies on `MockTime` instead of the real > time clock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14780) Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic
Alexandre Dupriez created KAFKA-14780: - Summary: Make RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay deterministic Key: KAFKA-14780 URL: https://issues.apache.org/jira/browse/KAFKA-14780 Project: Kafka Issue Type: Test Reporter: Alexandre Dupriez The test `RefreshingHttpsJwksTest#testSecondaryRefreshAfterElapsedDelay` relies on the actual system clock which makes it frequently fail on my poor intellij setup. The `RefreshingHttpsJwks` component creates and uses a scheduled executor service. We could expose the scheduling mechanism to be able to mock its behaviour. One way to do could be to use the `KafkaScheduler` which has a `MockScheduler` implementation which relies on `MockTime` instead of the real time clock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14779) Add ACL Authorizer integration test for authorized OffsetCommits with an unknown topic
[ https://issues.apache.org/jira/browse/KAFKA-14779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez reassigned KAFKA-14779: - Assignee: Alexandre Dupriez > Add ACL Authorizer integration test for authorized OffsetCommits with an > unknown topic > -- > > Key: KAFKA-14779 > URL: https://issues.apache.org/jira/browse/KAFKA-14779 > Project: Kafka > Issue Type: Sub-task >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Discovered as part of > [PR-13240|https://github.com/apache/kafka/pull/13240),], it seems the use > case where a group and topic have the necessary ACLs to allow for offsets for > that topic and consumer group to be committed, but the topic is unknown by > the broker (either by name or id), is not covered. This purpose of this > ticket is to add this coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14779) Add ACL Authorizer integration test for authorized OffsetCommits with an unknown topic
Alexandre Dupriez created KAFKA-14779: - Summary: Add ACL Authorizer integration test for authorized OffsetCommits with an unknown topic Key: KAFKA-14779 URL: https://issues.apache.org/jira/browse/KAFKA-14779 Project: Kafka Issue Type: Sub-task Reporter: Alexandre Dupriez Discovered as part of [PR-13240|https://github.com/apache/kafka/pull/13240),], it seems the use case where a group and topic have the necessary ACLs to allow for offsets for that topic and consumer group to be committed, but the topic is unknown by the broker (either by name or id), is not covered. This purpose of this ticket is to add this coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010)