[jira] [Resolved] (KAFKA-17076) logEndOffset could be lost due to log cleaning

2024-09-24 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-17076.
-
Fix Version/s: 4.0.0
 Assignee: Vincent Jiang
   Resolution: Fixed

Merged the PR to trunk.

> logEndOffset could be lost due to log cleaning
> --
>
> Key: KAFKA-17076
> URL: https://issues.apache.org/jira/browse/KAFKA-17076
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Jun Rao
>Assignee: Vincent Jiang
>Priority: Major
> Fix For: 4.0.0
>
>
> It's possible for the log cleaner to remove all records in the suffix of the 
> log. If the partition is then reassigned, the new replica won't be able to 
> see the true logEndOffset since there is no record batch associated with it. 
> If this replica becomes the leader, it will assign an already used offset to 
> a newly produced record, which is incorrect.
>  
> It's relatively rare to trigger this issue since the active segment is never 
> cleaned and typically is not empty. However, the following is one possibility.
>  # records with offset 100-110 are produced and fully replicated to all ISR. 
> All those records are delete records for certain keys.
>  # record with offset 111 is produced. It forces the roll of a new segment in 
> broker b1 and is added to the log. The record is not committed and is later 
> truncated from the log, leaving an empty active segment in this log. b1 at 
> some point becomes the leader.
>  # log cleaner kicks in and removes records 100-110.
>  # The partition is reassigned to another broker b2. b2 replicates all 
> records from b1 up to offset 100 and marks its logEndOffset at 100. Since 
> there is no record to replicate after offset 100 in b1, b2's logEndOffset 
> stays at 100 and b2 can join the ISR.
>  # b2 becomes the leader and assign offset 100 to a new record.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17400) Introduce a purgatory to deal with share fetch requests that cannot be completed instantaneously

2024-09-11 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-17400.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

merged the PR to trunk.

> Introduce a purgatory to deal with share fetch requests that cannot be 
> completed instantaneously
> 
>
> Key: KAFKA-17400
> URL: https://issues.apache.org/jira/browse/KAFKA-17400
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Abhinav Dixit
>Assignee: Abhinav Dixit
>Priority: Major
> Fix For: 4.0.0
>
>
> h1. When record lock partition limit is reached, the ShareFetch should wait 
> for up to MaxWaitMs for records to be released



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16746) Add support for share acknowledgement request in KafkaApis

2024-08-12 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-16746.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

merged the PR to trunk

> Add support for share acknowledgement request in KafkaApis
> --
>
> Key: KAFKA-16746
> URL: https://issues.apache.org/jira/browse/KAFKA-16746
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Chirag Wadhwa
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16745) Add support for share fetch request in KafkaApis

2024-08-06 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-16745.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged the PR to trunk.

> Add support for share fetch request in KafkaApis
> 
>
> Key: KAFKA-16745
> URL: https://issues.apache.org/jira/browse/KAFKA-16745
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Chirag Wadhwa
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17230) Kafka consumer client doesn't report node request-latency metrics

2024-08-01 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-17230.
-
Fix Version/s: 3.9.0
   Resolution: Fixed

merged the PR to trunk

> Kafka consumer client doesn't report node request-latency metrics
> -
>
> Key: KAFKA-17230
> URL: https://issues.apache.org/jira/browse/KAFKA-17230
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, metrics
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Major
>  Labels: client, consumer
> Fix For: 3.9.0
>
>
> Kafka Consumer client registers node/connection latency metrics in 
> [Selector.java|https://github.com/apache/kafka/blob/9d634629f29cd402a723d7e7bece18c8099065a4/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1359]
>  but the values against the metric is never recorded. This seems to be an 
> issue since inception. However, the same metric is also created for Kafka 
> Producer but the value is recorded for Kafka Producer in 
> [Sender.java|https://github.com/apache/kafka/blob/9d634629f29cd402a723d7e7bece18c8099065a4/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L1092].
>  Hence node - request-latency-max and request-latency-avg appears correctly 
> for Kafka Producer but is NaN for Kafka Consumer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17076) logEndOffset could be lost due to log cleaning

2024-07-03 Thread Jun Rao (Jira)
Jun Rao created KAFKA-17076:
---

 Summary: logEndOffset could be lost due to log cleaning
 Key: KAFKA-17076
 URL: https://issues.apache.org/jira/browse/KAFKA-17076
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Jun Rao


It's possible for the log cleaner to remove all records in the suffix of the 
log. If the partition is then reassigned, the new replica won't be able to see 
the true logEndOffset since there is no record batch associated with it. If 
this replica becomes the leader, it will assign an already used offset to a 
newly produced record, which is incorrect.

 

It's relatively rare to trigger this issue since the active segment is never 
cleaned and typically is not empty. However, the following is one possibility.
 # records with offset 100-110 are produced and fully replicated to all ISR. 
All those records are delete records for certain keys.
 # record with offset 111 is produced. It forces the roll of a new segment in 
broker b1 and is added to the log. The record is not committed and is later 
truncated from the log, leaving an empty active segment in this log. b1 at some 
point becomes the leader.
 # log cleaner kicks in and removes records 100-110.
 # The partition is reassigned to another broker b2. b2 replicates all records 
from b1 up to offset 100 and marks its logEndOffset at 100. Since there is no 
record to replicate after offset 100 in b1, b2's logEndOffset stays at 100 and 
b2 can join the ISR.
 # b2 becomes the leader and assign offset 100 to a new record.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16541) Potential leader epoch checkpoint file corruption on OS crash

2024-06-05 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-16541.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged the PR to trunk.

> Potential leader epoch checkpoint file corruption on OS crash
> -
>
> Key: KAFKA-16541
> URL: https://issues.apache.org/jira/browse/KAFKA-16541
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0
>Reporter: Haruki Okada
>Assignee: Haruki Okada
>Priority: Minor
> Fix For: 4.0.0
>
>
> Pointed out by [~junrao] on 
> [GitHub|https://github.com/apache/kafka/pull/14242#discussion_r1556161125]
> [A patch for KAFKA-15046|https://github.com/apache/kafka/pull/14242] got rid 
> of fsync of leader-epoch ckeckpoint file in some path for performance reason.
> However, since now checkpoint file is flushed to the device asynchronously by 
> OS, content would corrupt if OS suddenly crashes (e.g. by power failure, 
> kernel panic) in the middle of flush.
> Corrupted checkpoint file could prevent Kafka broker to start-up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16485) Fix broker metrics to follow kebab/hyphen case

2024-04-09 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-16485.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

Merged the PR to trunk.

> Fix broker metrics to follow kebab/hyphen case
> --
>
> Key: KAFKA-16485
> URL: https://issues.apache.org/jira/browse/KAFKA-16485
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Major
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15950) Serialize broker heartbeat requests

2024-03-25 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15950.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

merged the PR to trunk.

> Serialize broker heartbeat requests
> ---
>
> Key: KAFKA-15950
> URL: https://issues.apache.org/jira/browse/KAFKA-15950
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0
>Reporter: Jun Rao
>Assignee: Igor Soarez
>Priority: Major
> Fix For: 3.8.0
>
>
> This is a followup issue from the discussion in 
> [https://github.com/apache/kafka/pull/14836#discussion_r1409739363].
> {{KafkaEventQueue}} does de-duping and only allows one outstanding 
> {{CommunicationEvent}} in the queue. But it seems that duplicated 
> {{{}HeartbeatRequest{}}}s could still be generated. {{CommunicationEvent}} 
> calls {{sendBrokerHeartbeat}} that calls the following.
> {code:java}
> _channelManager.sendRequest(new BrokerHeartbeatRequest.Builder(data), 
> handler){code}
> The problem is that we have another queue in 
> {{NodeToControllerChannelManagerImpl}} that doesn't do the de-duping. Once a 
> {{CommunicationEvent}} is dequeued from {{{}KafkaEventQueue{}}}, a 
> {{HeartbeatRequest}} will be queued in 
> {{{}NodeToControllerChannelManagerImpl{}}}. At this point, another 
> {{CommunicationEvent}} could be enqueued in {{{}KafkaEventQueue{}}}. When 
> it's processed, another {{HeartbeatRequest}} will be queued in 
> {{{}NodeToControllerChannelManagerImpl{}}}.
> This probably won't introduce long lasting duplicated {{HeartbeatRequest}} in 
> practice since {{CommunicationEvent}} is typically queued in 
> {{KafkaEventQueue}} for heartbeat interval. By that time, other pending 
> {{{}HeartbeatRequest{}}}s will be processed and de-duped when enqueuing to 
> {{{}KafkaEventQueue{}}}. However, duplicated requests could make it hard to 
> reason about tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16186) Implement broker metrics for client telemetry usage

2024-01-30 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-16186.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

merged the PR to trunk

> Implement broker metrics for client telemetry usage
> ---
>
> Key: KAFKA-16186
> URL: https://issues.apache.org/jira/browse/KAFKA-16186
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Major
> Fix For: 3.8.0
>
>
> The KIP-714 lists new metrics for broker which records the usage of client 
> telemetry instances and plugin. Implement broker metrics as defined in the 
> KIP-714.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15813) Improve implementation of client instance cache

2024-01-23 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15813.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

Merged to the PR to trunk.

> Improve implementation of client instance cache
> ---
>
> Key: KAFKA-15813
> URL: https://issues.apache.org/jira/browse/KAFKA-15813
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Major
> Fix For: 3.8.0
>
>
> In the current implementation the ClientMetricsManager uses LRU cache but we 
> should alos support expiring stale clients i.e. client which haven't reported 
> metrics for a while.
>  
> The KIP mentions: This client instance specific state is maintained in broker 
> memory up to MAX(60*1000, PushIntervalMs * 3) milliseconds and is used to 
> enforce the push interval rate-limiting. There is no persistence of client 
> instance metrics state across broker restarts or between brokers 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16137) ListClientMetricsResourcesResponse definition is missing field descriptions

2024-01-22 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-16137.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

merged the PR to trunk.

> ListClientMetricsResourcesResponse definition is missing field descriptions
> ---
>
> Key: KAFKA-16137
> URL: https://issues.apache.org/jira/browse/KAFKA-16137
> Project: Kafka
>  Issue Type: Task
>  Components: admin
>Affects Versions: 3.7.0
>Reporter: Andrew Schofield
>Assignee: Andrew Schofield
>Priority: Trivial
> Fix For: 3.8.0
>
>
> This is purely improving the readability of the Kafka protocol documentation 
> by adding missing description information for the fields of the 
> `ListClientMetricsResources` response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15811) implement capturing client port information from Socket Server

2024-01-19 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15811.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

merged the PR to trunk.

> implement capturing client port information from Socket Server
> --
>
> Key: KAFKA-15811
> URL: https://issues.apache.org/jira/browse/KAFKA-15811
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Major
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15684) Add support to describe all subscriptions through utility

2023-12-06 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15684.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk.

> Add support to describe all subscriptions through utility
> -
>
> Key: KAFKA-15684
> URL: https://issues.apache.org/jira/browse/KAFKA-15684
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apoorv Mittal
>Assignee: Apoorv Mittal
>Priority: Major
> Fix For: 3.7.0
>
>
> Open PR to support client-metrics through kafka-configs.sh doesn't list all 
> subscriptions. The functionality is missing because of missing support to 
> list client subscription in config repository and admin client. This task 
> should provide a workaround to fetch all subscriptions from config repository 
> by adding a method in KRaftMetadataCache. Later a KIP might be needed to add 
> support in AdminClient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15871) Implement kafka-client-metrics.sh tool

2023-12-06 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15871.
-
Resolution: Fixed

merged the PR to trunk

> Implement kafka-client-metrics.sh tool
> --
>
> Key: KAFKA-15871
> URL: https://issues.apache.org/jira/browse/KAFKA-15871
> Project: Kafka
>  Issue Type: Sub-task
>  Components: admin
>Affects Versions: 3.7.0
>Reporter: Andrew Schofield
>Assignee: Andrew Schofield
>Priority: Major
> Fix For: 3.7.0
>
>
> Implement the `kafka-client-metrics.sh` tool which is introduced in KIP-714 
> and enhanced in KIP-1000.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15831) List Client Metrics Configuration Resources

2023-12-05 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15831.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk.

> List Client Metrics Configuration Resources
> ---
>
> Key: KAFKA-15831
> URL: https://issues.apache.org/jira/browse/KAFKA-15831
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin, clients
>Reporter: Andrew Schofield
>Assignee: Andrew Schofield
>Priority: Major
>  Labels: kip
> Fix For: 3.7.0
>
>
> This JIRA tracks the development of KIP-1000 
> (https://cwiki.apache.org/confluence/display/KAFKA/KIP-1000%3A+List+Client+Metrics+Configuration+Resources).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15965) Test failure: org.apache.kafka.common.requests.BrokerRegistrationRequestTest

2023-12-04 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15965.
-
Fix Version/s: 3.7.0
 Assignee: Colin McCabe
   Resolution: Fixed

This is fixed by https://github.com/apache/kafka/pull/14887.

> Test failure: org.apache.kafka.common.requests.BrokerRegistrationRequestTest
> 
>
> Key: KAFKA-15965
> URL: https://issues.apache.org/jira/browse/KAFKA-15965
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apoorv Mittal
>Assignee: Colin McCabe
>Priority: Major
> Fix For: 3.7.0
>
>
> 2 tests for versions 0 and 1 fails consistently.
> Build: 
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14767/15/tests/
>  
> {code:java}
> org.opentest4j.AssertionFailedError: 
> BrokerRegistrationRequestData(brokerId=0, clusterId='test', 
> incarnationId=Xil73H5bSZ2vYTWUVlf07Q, listeners=[], features=[], rack='a', 
> isMigratingZkBroker=false, logDirs=[], previousBrokerEpoch=1) ==> expected: 
> <-1> but was: <1>Stacktraceorg.opentest4j.AssertionFailedError: 
> BrokerRegistrationRequestData(brokerId=0, clusterId='test', 
> incarnationId=Xil73H5bSZ2vYTWUVlf07Q, listeners=[], features=[], rack='a', 
> isMigratingZkBroker=false, logDirs=[], previousBrokerEpoch=1) ==> expected: 
> <-1> but was: <1>  at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at 
> app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)  
> at 
> app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:166)  
> at app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:660)
>   at 
> app//org.apache.kafka.common.requests.BrokerRegistrationRequestTest.testBasicBuild(BrokerRegistrationRequestTest.java:57)
> at 
> java.base@17.0.7/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)at 
> java.base@17.0.7/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>   at 
> java.base@17.0.7/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base@17.0.7/java.lang.reflect.Method.invoke(Method.java:568)
> at 
> app//org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:728)
>   at 
> app//org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>at 
> app//org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>  at 
> app//org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
> at 
> app//org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
>   at 
> app//org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:94)
>at 
> app//org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
> at 
> app//org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
>  at 
> app//org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> at 
> app//org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>at 
> app//org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15950) CommunicationEvent should be scheduled with EarliestDeadlineFunction

2023-11-29 Thread Jun Rao (Jira)
Jun Rao created KAFKA-15950:
---

 Summary: CommunicationEvent should be scheduled with 
EarliestDeadlineFunction
 Key: KAFKA-15950
 URL: https://issues.apache.org/jira/browse/KAFKA-15950
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 3.7.0
Reporter: Jun Rao


Currently, CommunicationEvent is scheduled with DeadlineFunction, which ignores 
the schedule time for an existing event. This wasn't an issue when 
CommunicationEvent is always periodic. However, with KAFKA-15360,  a 
CommunicationEvent could be scheduled immediately for offline dirs. If a 
periodic CommunicationEvent is scheduled after the immediate CommunicationEvent 
in KafkaEventQueue, the former will cancel the latter, but leaves the schedule 
time to be periodic. This will unnecessarily delay the communication of the 
failed dir to the controller. 
 
Using EarliestDeadlineFunction will fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15046) Produce performance issue under high disk load

2023-11-29 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15046.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk.

> Produce performance issue under high disk load
> --
>
> Key: KAFKA-15046
> URL: https://issues.apache.org/jira/browse/KAFKA-15046
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 3.3.2
>Reporter: Haruki Okada
>Assignee: Haruki Okada
>Priority: Major
>  Labels: performance
> Fix For: 3.7.0
>
> Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png, image-2023-08-18-19-23-36-597.png, 
> image-2023-08-18-19-29-56-377.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x7efdaeba7770 nid=0x1e704 
> runnable  [0x7ef9a12e2000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
> at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
> at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
> at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
> at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
> at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
> - locked <0x00060d75d820> (a java.lang.Object)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
> at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
> at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
> at 
> kafka.server.ReplicaManager$$Lambda$2379/0x000800b7c040.apply(Unknown 
> Source)
> at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
> at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
> at scala.collection.mutable.HashMap.map(HashMap.scala:35)
> at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
> at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
> at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
> at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
> at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
> at java.lang.Thread.run(java.base@11.0.17/Thread.java:829) {code}
>  * 
>  ** Also there were bunch of logs that writing producer snapshots took 
> hundreds of milliseconds.
>  *** 
> {code:java}
> ...
> [2023-05-01 11:08:36,689] INFO [ProducerStateManager partition=xxx-4] Wrote 
> producer snapshot at offset 1748817854 with 8 producer ids in 809 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:37,319] INFO [ProducerStateManager partition=yyy-34] Wrote 
> producer snapshot at offset 247996937813 with 0 producer ids in 547 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:38,887] INFO [ProducerStateManager partition=zzz-9] Wrote 
> producer snapshot at offset 226222355404 with 0 producer ids in 576 ms

[jira] [Resolved] (KAFKA-15651) Investigate auto commit guarantees during Consumer.assign()

2023-10-20 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15651.
-
Resolution: Not A Problem

> Investigate auto commit guarantees during Consumer.assign()
> ---
>
> Key: KAFKA-15651
> URL: https://issues.apache.org/jira/browse/KAFKA-15651
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Major
>  Labels: consumer-threading-refactor, kip-848-preview
>
> In the {{assign()}} method implementation, both {{KafkaConsumer}} and 
> {{PrototypeAsyncConsumer}} commit offsets asynchronously. Is this 
> intentional? [~junrao] asks in a [recent PR 
> review|https://github.com/apache/kafka/pull/14406/files/193af8230d0c61853d764cbbe29bca2fc6361af9#r1349023459]:
> {quote}Do we guarantee that the new owner of the unsubscribed partitions 
> could pick up the latest committed offset?
> {quote}
> Let's confirm whether the asynchronous approach is acceptable and correct. If 
> it is, great, let's enhance the documentation to briefly explain why. If it 
> is not, let's correct the behavior if it's within the API semantic 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15582) Clean shutdown detection, broker side

2023-10-19 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15582.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk

> Clean shutdown detection, broker side
> -
>
> Key: KAFKA-15582
> URL: https://issues.apache.org/jira/browse/KAFKA-15582
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
> Fix For: 3.7.0
>
>
> The clean shutdown file can now include the broker epoch before shutdown. 
> During the broker start process, the broker should extract the broker epochs 
> from the clean shutdown files. If successful, send the broker epoch through 
> the broker registration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14960) Metadata Request Manager and listTopics/partitionsFor API

2023-09-21 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14960.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk.

> Metadata Request Manager and listTopics/partitionsFor API
> -
>
> Key: KAFKA-14960
> URL: https://issues.apache.org/jira/browse/KAFKA-14960
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> Implement listTopics and partitionsFor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15306) Integrate committed offsets logic when updating fetching positions

2023-09-18 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15306.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk

> Integrate committed offsets logic when updating fetching positions
> --
>
> Key: KAFKA-15306
> URL: https://issues.apache.org/jira/browse/KAFKA-15306
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> Integrate refreshCommittedOffsets logic, currently performed by the 
> coordinator, into the update fetch positions performed on every iteration of 
> the async consumer poll loop. This should rely on the CommitRequestManager to 
> perform the request based on the refactored model, but it should reuse the 
> logic for processing the committed offsets and updating the positions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15163) Implement validatePositions functionality for new KafkaConsumer

2023-09-13 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15163.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

This is covered in https://github.com/apache/kafka/pull/14346.

> Implement validatePositions functionality for new KafkaConsumer
> ---
>
> Key: KAFKA-15163
> URL: https://issues.apache.org/jira/browse/KAFKA-15163
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> Introduce support for validating positions in the new OffsetsRequestManager. 
> This task will include a new event for the validatePositions calls performed 
> from the new consumer, and the logic for handling such events in the 
> OffsetRequestManager.
> The validate positions implementation will keep the same behaviour as the one 
> in the old consumer, but adapted to the new threading model. So it is based 
> in a VALIDATE_POSITIONS events that is submitted to the background thread, 
> and the processed by the ApplicationEventProcessor. The processing itself is 
> done by the OffsetRequestManager given that this will require an 
> OFFSET_FOR_LEADER_EPOCH request. This task will introduce support for such 
> requests in the OffsetRequestManager, responsible for offset-related requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15115) Implement resetPositions functionality in OffsetsRequestManager

2023-09-13 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15115.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk

> Implement resetPositions functionality in OffsetsRequestManager
> ---
>
> Key: KAFKA-15115
> URL: https://issues.apache.org/jira/browse/KAFKA-15115
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> Introduce support for resetting positions in the new OffsetsRequestManager. 
> This task will include a new event for the resetPositions calls performed 
> from the new consumer, and the logic for handling such events in the 
> OffsetRequestManager.
> The reset positions implementation will keep the same behaviour as the one in 
> the old consumer, but adapted to the new threading model. So it is based in a 
> RESET_POSITIONS events that is submitted to the background thread, and then 
> processed by the ApplicationEventProcessor. The processing itself is done by 
> the OffsetRequestManager given that this will require a LIST_OFFSETS request 
> for the partitions awaiting reset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-9800) [KIP-580] Client Exponential Backoff Implementation

2023-09-05 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-9800.

Fix Version/s: 3.7.0
   Resolution: Fixed

Merged [https://github.com/apache/kafka/pull/14111] to trunk.

> [KIP-580] Client Exponential Backoff Implementation
> ---
>
> Key: KAFKA-9800
> URL: https://issues.apache.org/jira/browse/KAFKA-9800
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Cheng Tan
>Assignee: Andrew Schofield
>Priority: Major
>  Labels: KIP-580, client
> Fix For: 3.7.0
>
>
> Design:
> The main idea is to bookkeep the failed attempt. Currently, the retry backoff 
> has two main usage patterns:
>  # Synchronous retires and blocking loop. The thread will sleep in each 
> iteration for retry backoff ms.
>  # Async retries. In each polling, the retries do not meet the backoff will 
> be filtered. The data class often maintains a 1:1 mapping to a set of 
> requests which are logically associated. (i.e. a set contains only one 
> initial request and only its retries.)
> For type 1, we can utilize a local failure counter of a Java generic data 
> type.
> For case 2, I already wrapped the exponential backoff/timeout util class in 
> my KIP-601 
> [implementation|https://github.com/apache/kafka/pull/8683/files#diff-9ca2b1294653dfa914b9277de62b52e3R28]
>  which takes the number of attempts and returns the backoff/timeout value at 
> the corresponding level. Thus, we can add a new class property to those 
> classes containing retriable data in order to record the number of failed 
> attempts.
>  
> Changes:
> KafkaProducer:
>  # Produce request (ApiKeys.PRODUCE). Currently, the backoff applies to each 
> ProducerBatch in Accumulator, which already has an attribute attempts 
> recording the number of failed attempts. So we can let the Accumulator 
> calculate the new retry backoff for each bach when it enqueues them, to avoid 
> instantiate the util class multiple times.
>  # Transaction request (ApiKeys..*TXN). TxnRequestHandler will have a new 
> class property of type `Long` to record the number of attempts.
> KafkaConsumer:
>  # Some synchronous retry use cases. Record the failed attempts in the 
> blocking loop.
>  # Partition request (ApiKeys.OFFSET_FOR_LEADER_EPOCH, ApiKeys.LIST_OFFSETS). 
> Though the actual requests are packed for each node, the current 
> implementation is applying backoff to each topic partition, where the backoff 
> value is kept by TopicPartitionState. Thus, TopicPartitionState will have the 
> new property recording the number of attempts.
> Metadata:
>  #  Metadata lives as a singleton in many clients. Add a new property 
> recording the number of attempts
>  AdminClient:
>  # AdminClient has its own request abstraction Call. The failed attempts are 
> already kept by the abstraction. So probably clean the Call class logic a bit.
> Existing tests:
>  # If the tests are testing the retry backoff, add a delta to the assertion, 
> considering the existence of the jitter.
>  # If the tests are testing other functionality, we can specify the same 
> value for both `retry.backoff.ms` and `retry.backoff.max.ms` in order to make 
> the retry backoff static. We can use this trick to make the existing tests 
> compatible with the changes.
> There're other common usages look like client.poll(timeout), where the 
> timeout passed in is the retry backoff value. We won't change these usages 
> since its underlying logic is nioSelector.select(timeout) and 
> nioSelector.selectNow(), which means if no interested op exists, the client 
> will block retry backoff milliseconds. This is an optimization when there's 
> no request that needs to be sent but the client is waiting for responses. 
> Specifically, if the client fails the inflight requests before the retry 
> backoff milliseconds passed, it still needs to wait until that amount of time 
> passed, unless there's a new request need to be sent.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15081) Implement new consumer offsetsForTimes

2023-09-01 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-15081.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk

> Implement new consumer offsetsForTimes
> --
>
> Key: KAFKA-15081
> URL: https://issues.apache.org/jira/browse/KAFKA-15081
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> Implement offsetForTimes for the kafka consumer based on the new threading 
> model, using the OffsetsRequestManager. No changes at the consumer API level.
> A call to the offsetsForTime consumer functionality should generate a 
> ListOffsetsApplicationEvent with the timestamps to search provided in 
> parameters. The ListOffsetsApplicationEvent is then handled by the 
> OffsetsRequestManager, that builds the ListOffsets requests and processes the 
> response when received.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14965) Introduce OffsetsRequestManager to integrate ListOffsets requests into new consumer threading refactor

2023-09-01 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14965.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk

> Introduce OffsetsRequestManager to integrate ListOffsets requests into new 
> consumer threading refactor
> --
>
> Key: KAFKA-14965
> URL: https://issues.apache.org/jira/browse/KAFKA-14965
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> This task introduces new functionality for handling ListOffsets requests for 
> the new consumer implementation, as part for the ongoing work for the 
> consumer threading model refactor.
> This task introduces a new class named \{{OffsetsRequestManager, 
> }}responsible of :
>  * building ListOffsets requests
>  * process its responses
> Consumer API functionality that requires ListOffsets requests are implemented 
> using this manager: beginningOffsets, endOffsets and offsetsForTimes.
> These consumer API functions will generate a ListOffsetsApplicationEvent with 
> parameters. This event is then handled by the OffsetsRequestManager, who will 
> build the ListOffsets request and process its responses, to provide a result 
> back to the API via the ListOffsetsApplicationEvent completion.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14875) Implement Wakeup()

2023-08-29 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14875.
-
Fix Version/s: 3.7.0
   Resolution: Fixed

merged the PR to trunk

> Implement Wakeup()
> --
>
> Key: KAFKA-14875
> URL: https://issues.apache.org/jira/browse/KAFKA-14875
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.7.0
>
>
> Implement wakeup() and WakeupException.  This would be different to the 
> current implementation because I think we just need to interrupt the blocking 
> futures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14950) Implement assign() and assignment()

2023-07-21 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14950.
-
Fix Version/s: 3.6.0
   Resolution: Fixed

merged the PR to trunk.

> Implement assign() and assignment()
> ---
>
> Key: KAFKA-14950
> URL: https://issues.apache.org/jira/browse/KAFKA-14950
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: kip-945
> Fix For: 3.6.0
>
>
> Implement assign() and assignment()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14966) Extract reusable common logic from OffsetFetcher

2023-06-08 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14966.
-
Fix Version/s: 3.6.0
   Resolution: Fixed

Merged the PR to trunk.

> Extract reusable common logic from OffsetFetcher
> 
>
> Key: KAFKA-14966
> URL: https://issues.apache.org/jira/browse/KAFKA-14966
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
> Fix For: 3.6.0
>
>
> The OffsetFetcher is internally used by the KafkaConsumer to fetch offsets, 
> validate and reset positions. 
> For the new consumer based on a refactored threading model, similar 
> functionality will be needed by the ListOffsetsRequestManager component. 
> This task aims at identifying and extracting the OffsetFetcher functionality 
> that can be reused by the new consumer implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15058) Improve the accuracy of Histogram in client metric

2023-06-05 Thread Jun Rao (Jira)
Jun Rao created KAFKA-15058:
---

 Summary: Improve the accuracy of Histogram in client metric
 Key: KAFKA-15058
 URL: https://issues.apache.org/jira/browse/KAFKA-15058
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Reporter: Jun Rao


The Histogram type (org.apache.kafka.common.metrics.stats) in KafkaMetrics in 
the client module statically divides the value space into a fixed number of 
buckets and only returns values on the bucket boundary. So, the returned 
histogram value may never show up in a recorded value. Yammer Histogram, on the 
other hand, uses reservoir sampling. The reported value is always one of the 
recorded values, and is likely more accurate. Because of this, the Histogram 
type in client metric hasn't been used widely. It would be useful to improve 
Histogram in the client metric to be more accurate. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14561) Improve transactions experience for older clients by ensuring ongoing transaction

2023-04-12 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14561.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

merged the PR to trunk.

> Improve transactions experience for older clients by ensuring ongoing 
> transaction
> -
>
> Key: KAFKA-14561
> URL: https://issues.apache.org/jira/browse/KAFKA-14561
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.5.0
>
>
> This is part 3 of KIP-890:
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
> See KIP-890 for more details: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14617) Replicas with stale broker epoch should not be allowed to join the ISR

2023-04-07 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14617.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

merged all PRs to trunk.

> Replicas with stale broker epoch should not be allowed to join the ISR
> --
>
> Key: KAFKA-14617
> URL: https://issues.apache.org/jira/browse/KAFKA-14617
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14685) TierStateMachine interface for building remote aux log

2023-02-24 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14685.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

merged the PR to trunk.

> TierStateMachine interface for building remote aux log
> --
>
> Key: KAFKA-14685
> URL: https://issues.apache.org/jira/browse/KAFKA-14685
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Matthew Wong
>Assignee: Matthew Wong
>Priority: Major
> Fix For: 3.5.0
>
>
> To help with https://issues.apache.org/jira/browse/KAFKA-13560 , we can 
> introduce an interface to manage state transitions of building the remote aux 
> log asynchronously



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-12736) KafkaProducer.flush holds onto completed ProducerBatch(s) until flush completes

2022-12-19 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12736.
-
Fix Version/s: 3.0.0
 Assignee: Lucas Bradstreet
   Resolution: Fixed

> KafkaProducer.flush holds onto completed ProducerBatch(s) until flush 
> completes
> ---
>
> Key: KAFKA-12736
> URL: https://issues.apache.org/jira/browse/KAFKA-12736
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Lucas Bradstreet
>Assignee: Lucas Bradstreet
>Priority: Minor
> Fix For: 3.0.0
>
>
> When flush is called a copy of the incomplete batches is made. This means 
> that the full ProducerBatch(s) are held in memory until the flush has 
> completed. For batches where the existing memory pool is used this is not as 
> wasteful as the memory will already be returned to the pool, but for non pool 
> memory it can only be GC'd after the flush has completed. Rather than use 
> copyAll we can make a new array with only the produceFuture(s) and await on 
> those.
>  
> {code:java}
> /**
>  * Mark all partitions as ready to send and block until the send is complete
>  */
> public void awaitFlushCompletion() throws InterruptedException {
>  try {
>  for (ProducerBatch batch : this.incomplete.copyAll())
>  batch.produceFuture.await();
>  } finally {
>  this.flushesInProgress.decrementAndGet();
>  }
> }
> {code}
> This may help in cases where the application is already memory constrained 
> and memory usage is slowing progress on completion of the incomplete batches.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13369) Follower fetch protocol enhancements for tiered storage.

2022-12-17 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13369.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

merged the PR to trunk.

> Follower fetch protocol enhancements for tiered storage.
> 
>
> Key: KAFKA-13369
> URL: https://issues.apache.org/jira/browse/KAFKA-13369
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14303) Producer.send without record key and batch.size=0 goes into infinite loop

2022-11-28 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14303.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

Merged the PR to 3.3 and trunk.

> Producer.send without record key and batch.size=0 goes into infinite loop
> -
>
> Key: KAFKA-14303
> URL: https://issues.apache.org/jira/browse/KAFKA-14303
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Igor Soarez
>Assignee: Igor Soarez
>Priority: Major
>  Labels: Partitioner, bug, client, producer, regresion
> Fix For: 3.4.0, 3.3.2
>
>
> 3.3 has broken previous producer behavior.
> A call to {{producer.send(record)}} with a record without a key and 
> configured with {{batch.size=0}} never returns.
> Reproducer:
> {code:java}
> class ProducerIssueTest extends IntegrationTestHarness {
>   override protected def brokerCount = 1
>   @Test
>   def testProduceWithBatchSizeZeroAndNoRecordKey(): Unit = {
> val topicName = "foo"
> createTopic(topicName)
> val overrides = new Properties
> overrides.put(ProducerConfig.BATCH_SIZE_CONFIG, 0)
> val producer = createProducer(keySerializer = new StringSerializer, 
> valueSerializer = new StringSerializer, overrides)
> val record = new ProducerRecord[String, String](topicName, null, "hello")
> val future = producer.send(record) // goes into infinite loop here
> future.get(10, TimeUnit.SECONDS)
>   }
> } {code}
>  
> [Documentation for producer 
> configuration|https://kafka.apache.org/documentation/#producerconfigs_batch.size]
>  states {{batch.size=0}} as a valid value:
> {quote}Valid Values: [0,...]
> {quote}
> and recommends its use directly:
> {quote}A small batch size will make batching less common and may reduce 
> throughput (a batch size of zero will disable batching entirely).
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14321) max.compaction.lag.ms is not enforced accurately

2022-10-18 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14321.
-
Resolution: Duplicate

This actually duplicates KAFKA-10760. Closing this one.

> max.compaction.lag.ms is not enforced accurately
> 
>
> Key: KAFKA-14321
> URL: https://issues.apache.org/jira/browse/KAFKA-14321
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jun Rao
>Priority: Major
>
> Compaction only cleans data in non-active segments. When 
> max.compaction.lag.ms is set, we use it to set segment.ms to force segment 
> rolling by time. However, the current implementation of time-based segment 
> roll is not precise. It only rolls a segment if the new record's timestamp 
> differs from the timestamp of the first record in the segment by more than 
> segment.ms. If we have a bunch of records appended within segment.ms and then 
> stop producing new records, all those records could remain in the active 
> segments forever, which prevents the records to be cleaned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14321) max.compaction.lag.ms is not enforced accurately

2022-10-18 Thread Jun Rao (Jira)
Jun Rao created KAFKA-14321:
---

 Summary: max.compaction.lag.ms is not enforced accurately
 Key: KAFKA-14321
 URL: https://issues.apache.org/jira/browse/KAFKA-14321
 Project: Kafka
  Issue Type: Bug
Reporter: Jun Rao


Compaction only cleans data in non-active segments. When max.compaction.lag.ms 
is set, we use it to set segment.ms to force segment rolling by time. However, 
the current implementation of time-based segment roll is not precise. It only 
rolls a segment if the new record's timestamp differs from the timestamp of the 
first record in the segment by more than segment.ms. If we have a bunch of 
records appended within segment.ms and then stop producing new records, all 
those records could remain in the active segments forever, which prevents the 
records to be cleaned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-4545) tombstone needs to be removed after delete.retention.ms has passed after it has been cleaned

2022-10-18 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-4545.

Fix Version/s: 3.1.0
   Resolution: Fixed

This is fixed in KAFKA-8522.

> tombstone needs to be removed after delete.retention.ms has passed after it 
> has been cleaned
> 
>
> Key: KAFKA-4545
> URL: https://issues.apache.org/jira/browse/KAFKA-4545
> Project: Kafka
>  Issue Type: Improvement
>  Components: log
>Affects Versions: 0.10.0.0, 0.11.0.0, 1.0.0
>Reporter: Jun Rao
>Assignee: Richard Yu
>Priority: Minor
>  Labels: needs-kip
> Fix For: 3.1.0
>
>
> The algorithm for removing the tombstone in a compacted is supposed to be the 
> following.
> 1. Tombstone is never removed when it's still in the dirty portion of the log.
> 2. After the tombstone is in the cleaned portion of the log, we further delay 
> the removal of the tombstone by delete.retention.ms since the time the 
> tombstone is in the cleaned portion.
> Once the tombstone is in the cleaned portion, we know there can't be any 
> message with the same key before the tombstone. Therefore, for any consumer, 
> if it reads a non-tombstone message before the tombstone, but can read to the 
> end of the log within delete.retention.ms, it's guaranteed to see the 
> tombstone.
> However, the current implementation doesn't seem correct. We delay the 
> removal of the tombstone by delete.retention.ms since the last modified time 
> of the last cleaned segment. However, the last modified time is inherited 
> from the original segment, which could be arbitrarily old. So, the tombstone 
> may not be preserved as long as it needs to be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14156) Built-in partitioner may create suboptimal batches with large linger.ms

2022-09-14 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14156.
-
  Assignee: Artem Livshits
Resolution: Fixed

Merged the PR to 3.3 and trunk.

> Built-in partitioner may create suboptimal batches with large linger.ms
> ---
>
> Key: KAFKA-14156
> URL: https://issues.apache.org/jira/browse/KAFKA-14156
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.3.0
>Reporter: Artem Livshits
>Assignee: Artem Livshits
>Priority: Blocker
> Fix For: 3.3.0
>
>
> The new built-in "sticky" partitioner switches partitions based on the amount 
> of bytes produced to a partition.  It doesn't use batch creation as a switch 
> trigger.  The previous "sticky" DefaultPartitioner switched partition when a 
> new batch was created and with small linger.ms (default is 0) could result in 
> sending larger batches to slower brokers potentially overloading them.  See 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>  for more detail.
> However, the with large linger.ms, the new built-in partitioner may create 
> suboptimal batches.  Let's consider an example, suppose linger.ms=500, 
> batch.size=16KB (default) and we produce 24KB / sec, i.e. every 500ms we 
> produce 12KB worth of data.  The new built-in partitioner would switch 
> partition on every 16KB, so we could get into the following batching pattern:
>  * produce 12KB to one partition in 500ms, hit linger, send 12KB batch
>  * produce 4KB more to the same partition, now we've produced 16KB of data, 
> switch partition
>  * produce 12KB to the second partition in 500ms, hit linger, send 12KB batch
>  * in the mean time the 4KB produced to the first partition would hit linger 
> as well, sending 4KB batch
>  * produce 4KB more to the second partition, now we've produced 16KB of data 
> to the second partition, switch to 3rd partition
> so in this scenario the new built-in partitioner produces a mix of 12KB and 
> 4KB batches, while the previous DefaultPartitioner would produce only 12KB 
> batches -- it switches on new batch creation, so there is no "mid-linger" 
> leftover batches.
> To avoid creation of batch fragmentation on partition switch, we can wait 
> until the batch is ready before switching the partition, i.e. the condition 
> to switch to a new partition would be "produced batch.size bytes" AND "batch 
> is not lingering".  This may potentially introduce some non-uniformity into 
> data distribution, but unlike the previous DefaultPartitioner, the 
> non-uniformity would not be based on broker performance and won't 
> re-introduce the bad pattern of sending more data to slower brokers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14020) Performance regression in Producer

2022-07-20 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-14020.
-
Resolution: Fixed

merged the PR to 3.3.

> Performance regression in Producer
> --
>
> Key: KAFKA-14020
> URL: https://issues.apache.org/jira/browse/KAFKA-14020
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.3.0
>Reporter: John Roesler
>Assignee: Artem Livshits
>Priority: Blocker
> Fix For: 3.3.0
>
>
> [https://github.com/apache/kafka/commit/f7db6031b84a136ad0e257df722b20faa7c37b8a]
>  introduced a 10% performance regression in the KafkaProducer under a default 
> config.
>  
> The context for this result is a benchmark that we run for Kafka Streams. The 
> benchmark provisions 5 independent AWS clusters, including one broker node on 
> an i3.large and one client node on an i3.large. During a benchmark run, we 
> first run the Producer for 10 minutes to generate test data, and then we run 
> Kafka Streams under a number of configurations to measure its performance.
> Our observation was a 10% regression in throughput under the simplest 
> configuration, in which Streams simply consumes from a topic and does nothing 
> else. That benchmark actually runs faster than the producer that generates 
> the test data, so its thoughput is bounded by the data generator's 
> throughput. After investigation, we realized that the regression was in the 
> data generator, not the consumer or Streams.
> We have numerous benchmark runs leading up to the commit in question, and 
> they all show a throughput in the neighborhood of 115,000 records per second. 
> We also have 40 runs including and after that commit, and they all show a 
> throughput in the neighborhood of 105,000 records per second. A test on 
> [trunk with the commit reverted |https://github.com/apache/kafka/pull/12342] 
> shows a return to around 115,000 records per second.
> Config:
> {code:java}
> final Properties properties = new Properties();
> properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, broker);
> properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, 
> StringSerializer.class);
> properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, 
> StringSerializer.class);
> {code}
> Here's the producer code in the data generator. Our tests were running with 
> three produceThreads.
> {code:java}
>  for (int t = 0; t < produceThreads; t++) {
> futures.add(executorService.submit(() -> {
> int threadTotal = 0;
> long lastPrint = start;
> final long printInterval = Duration.ofSeconds(10).toMillis();
> long now;
> try (final org.apache.kafka.clients.producer.Producer 
> producer = new KafkaProducer<>(producerConfig(broker))) {
> while (limit > (now = System.currentTimeMillis()) - start) {
> for (int i = 0; i < 1000; i++) {
> final String key = keys.next();
> final String data = dataGen.generate();
> producer.send(new ProducerRecord<>(topic, key, 
> valueBuilder.apply(key, data)));
> threadTotal++;
> }
> if ((now - lastPrint) > printInterval) {
> System.out.println(Thread.currentThread().getName() + " 
> produced " + numberFormat.format(threadTotal) + " to " + topic + " in " + 
> Duration.ofMillis(now - start));
> lastPrint = now;
> }
> }
> }
> total.addAndGet(threadTotal);
> System.out.println(Thread.currentThread().getName() + " finished (" + 
> numberFormat.format(threadTotal) + ") in " + Duration.ofMillis(now - start));
> }));
> }{code}
> As you can see, this is a very basic usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13803) Refactor Leader API Access

2022-06-03 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13803.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

merged the PR to trunk

> Refactor Leader API Access
> --
>
> Key: KAFKA-13803
> URL: https://issues.apache.org/jira/browse/KAFKA-13803
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Rittika Adhikari
>Assignee: Rittika Adhikari
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, AbstractFetcherThread has a series of protected APIs which control 
> access to the Leader. ReplicaFetcherThread and ReplicaAlterLogDirsThread 
> respectively override these protected APIs and handle access to the Leader in 
> a remote broker leader and a local leader context.
> We propose to move these protected APIs to a LeaderEndPoint interface, which 
> will serve all fetches from the Leader. We will implement a 
> RemoteLeaderEndPoint and a LocalLeaderEndPoint accordingly. This change will 
> greatly simplify our existing follower fetch code.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-10888) Sticky partition leads to uneven product msg, resulting in abnormal delays in some partitions

2022-05-06 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10888.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

merged the PR to trunk. Thanks [~alivshits] for the design, implementation and 
the testing.

>  Sticky partition leads to uneven product msg, resulting in abnormal delays 
> in some partitions
> --
>
> Key: KAFKA-10888
> URL: https://issues.apache.org/jira/browse/KAFKA-10888
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 2.4.1
>Reporter: jr
>Assignee: Artem Livshits
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2020-12-24-21-05-02-800.png, 
> image-2020-12-24-21-09-47-692.png, image-2020-12-24-21-10-24-407.png
>
>
>   110 producers ,550 partitions ,550 consumers , 5 nodes Kafka cluster
>   The producer uses the nullkey+stick partitioner, the total production rate 
> is about 100w tps
> Observed partition delay is abnormal and message distribution is uneven, 
> which leads to the maximum production and consumption delay of the partition 
> with more messages 
> abnormal.
>   I cannot find reason that stick will make the message distribution uneven 
> at this production rate.
>   I can't switch to the round-robin partitioner, which will increase the 
> delay and cpu cost. Is thathe stick partationer design cause uneven message 
> distribution, or this is abnormal. How to solve it?
>   !image-2020-12-24-21-09-47-692.png!
> As shown in the picture, the uneven distribution is concentrated on some 
> partitions and some brokers, there seems to be some rules.
> This problem does not only occur in one cluster, but in many high tps 
> clusters,
> The problem is more obvious on the test cluster we built.
> !image-2020-12-24-21-10-24-407.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-13815) Avoid reinitialization for a replica that is being deleted

2022-05-04 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13815.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

merged the PR to trunk

> Avoid reinitialization for a replica that is being deleted
> --
>
> Key: KAFKA-13815
> URL: https://issues.apache.org/jira/browse/KAFKA-13815
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Lucas Wang
>Assignee: Lucas Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> https://issues.apache.org/jira/browse/KAFKA-10002
> identified that deletion of replicas can be slow when a StopReplica request 
> is being
> processed, and has implemented a change to improve the efficiency.
> We found that the efficiency can be further improved by avoiding the 
> reinitialization of the
> leader epoch cache and partition metadata for a replica that needs to be 
> deleted.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-13448) Kafka Producer Client Callback behaviour does not align with Javadoc

2022-04-26 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13448.
-
Fix Version/s: 3.2.0
 Assignee: Philip Nee
   Resolution: Fixed

merged [https://github.com/apache/kafka/pull/11689] and a followup fix 
[https://github.com/apache/kafka/pull/12064] to trunk and 3.2 branch.

> Kafka Producer Client Callback behaviour does not align with Javadoc
> 
>
> Key: KAFKA-13448
> URL: https://issues.apache.org/jira/browse/KAFKA-13448
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 1.1.0
>Reporter: Seamus O Ceanainn
>Assignee: Philip Nee
>Priority: Minor
> Fix For: 3.2.0
>
>
> In PR [#4188|https://github.com/apache/kafka/pull/4188] as part of 
> KAFKA-6180, a breaking change was accidentally introduced in the behaviour of 
> Callbacks for the producer client.
> Previously, whenever an exception was thrown when producing an event, the 
> value for 'metadata' passed to the Callback.onCompletion method was always 
> null. In PR [#4188|https://github.com/apache/kafka/pull/4188], in one part of 
> the code where Callback.onCompletion is called, the behaviour was changed so 
> that instead of passing a null value for metadata, a 'placeholder' value was 
> provided instead (see 
> [here|https://github.com/apache/kafka/pull/4188/files#diff-42d8f5166459ee28f201ff9cec0080fc7845544a0089ac9e8f3e16864cc1193eR1196]
>  and 
> [here|https://github.com/apache/kafka/pull/4188/files#diff-42d8f5166459ee28f201ff9cec0080fc7845544a0089ac9e8f3e16864cc1193eR1199]).
>   This placeholder contained only topic and partition information, and with 
> all other fields set to '-1'.
> This change only impacted a subset of exceptions, so that in the case of 
> ApiExceptions metadata would still be null (see 
> [here|https://github.com/apache/kafka/commit/aa42a11dfd99ee9ab24d2e9a7521ef1c97ae1ff4#diff-42d8f5166459ee28f201ff9cec0080fc7845544a0089ac9e8f3e16864cc1193eR843]),
>  but for all other exceptions the placeholder value would be used. The 
> behaviour at the time of writing remains the same.
> This issue was first reported in KAFKA-7412 when a user first noticed the 
> difference between the documented behaviour of Callback.onCompletion and the 
> implemented behaviour.
> At the time it was assumed that the behaviour when errors occur was to always 
> provide a placeholder metadata value to Callback.onCompletion, and the 
> documentation was updated at that time to reflect this assumption in [PR 
> #5798|https://github.com/apache/kafka/pull/5798]. The documentation now 
> states that when an exception occurs that the method will be called with an 
> empty metadata value (see 
> [here|https://github.com/apache/kafka/blob/3.1/clients/src/main/java/org/apache/kafka/clients/producer/Callback.java#L30-L31]).
>  However, there is still one case where Callback.onCompletion is called with 
> a null value for metadata (see 
> [here|https://github.com/apache/kafka/blob/3.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L1002]),
>  so there is still a discrepancy between the documented behaviour and the 
> implementation of Callback.onCompletion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KAFKA-13687) Limit number of batches when using kafka-dump-log.sh

2022-04-06 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13687.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

merged the PR to trunk.

> Limit number of batches when using kafka-dump-log.sh
> 
>
> Key: KAFKA-13687
> URL: https://issues.apache.org/jira/browse/KAFKA-13687
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.8.1
>Reporter: Sergio Troiano
>Assignee: Sergio Troiano
>Priority: Minor
>  Labels: easyfix, features
> Fix For: 3.3.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Currently the kafka-dump-log.sh reads the whole files(s) and dumps the 
> results of the segment file(s).
> As we know the savings when combining and using compression and batching 
> while producing (if the payloads are good candidates) are huge. 
>  
> We would like to have a way to "monitor" the way the producers produce the 
> batches as we not always  have access to the producer metrics.
> We have multitenant producers so it is hard to "detect" when the usage is not 
> the best.
>  
> The problem with the current way the DumpLogs works is it reads the whole 
> file, in an scenario of having thousands of topics with different segment 
> sizes (default is 1 GB) we could end up affecting the cluster balance as we 
> are removing useful pages from the page cache and adding what we read from 
> files. 
>  
> As we only need to take a few samples from the segments to see the pattern of 
> the usage while producing we would like to add a new parameter called 
> maxBatches.
>  
> Based on the current script the change is quite small as it only needs a 
> parameter and a counter.
>  
> After adding this change for example we could periodically take smaller 
> samples and analyze the batches headers (searching for compresscodec and the 
> batch count)
>  
> Doing this we could automate a tool to read all the topics and even going 
> further we could take the payloads of those samples when we see the client is 
> neither using compression nor batching and simulate a compression of the 
> payloads (using batching and compression) then with those numbers we can 
> reach the client for the proposal of saving money. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KAFKA-12875) Change Log layer segment map mutations to avoid absence of active segment

2022-03-31 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12875.
-
Fix Version/s: 3.3.0
 Assignee: Yu Yang
   Resolution: Fixed

Merged to trunk.

> Change Log layer segment map mutations to avoid absence of active segment
> -
>
> Key: KAFKA-12875
> URL: https://issues.apache.org/jira/browse/KAFKA-12875
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Kowshik Prakasam
>Assignee: Yu Yang
>Priority: Major
> Fix For: 3.3.0
>
>
> [https://github.com/apache/kafka/pull/10650] showed a case where active 
> segment was absent when Log layer segments were iterated. We should 
> investigate Log layer code to see if we can change Log layer segment map 
> mutations to avoid absence of active segment at any given point. For example, 
> if we are clearing all segments and creating a new one, maybe we can reverse 
> the order to create a new segment first and then clear the old ones later.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KAFKA-13723) max.compaction.lag.ms implemented incorrectly

2022-03-09 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13723.
-
Resolution: Not A Problem

> max.compaction.lag.ms implemented incorrectly
> -
>
> Key: KAFKA-13723
> URL: https://issues.apache.org/jira/browse/KAFKA-13723
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: Jun Rao
>Priority: Major
>
> In https://issues.apache.org/jira/browse/KAFKA-7321, we introduced 
> max.compaction.lag.ms to guarantee that a record be cleaned before a certain 
> time. 
>  
> The implementation in LogCleanerManager has the following code. The path for 
> earliestDirtySegmentTimestamp < cleanUntilTime seems incorrect. In that case, 
> it seems that we should set the delay to 0 so that we could trigger cleaning 
> immediately since the segment has been dirty for longer than 
> max.compaction.lag.ms. 
>  
>  
> {code:java}
> def maxCompactionDelay(log: UnifiedLog, firstDirtyOffset: Long, now: Long) : 
> Long = {
> ...
> val maxCompactionLagMs = math.max(log.config.maxCompactionLagMs, 0L)
> val cleanUntilTime = now - maxCompactionLagMs
> if (earliestDirtySegmentTimestamp < cleanUntilTime)
> cleanUntilTime - earliestDirtySegmentTimestamp
> else
> 0L
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KAFKA-13723) max.compaction.lag.ms implemented incorrectly

2022-03-09 Thread Jun Rao (Jira)
Jun Rao created KAFKA-13723:
---

 Summary: max.compaction.lag.ms implemented incorrectly
 Key: KAFKA-13723
 URL: https://issues.apache.org/jira/browse/KAFKA-13723
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.3.0
Reporter: Jun Rao


In https://issues.apache.org/jira/browse/KAFKA-7321, we introduced 
max.compaction.lag.ms to guarantee that a record be cleaned before a certain 
time. 

 

The implementation in LogCleanerManager has the following code. The path for 
earliestDirtySegmentTimestamp < cleanUntilTime seems incorrect. In that case, 
it seems that we should set the delay to 0 so that we could trigger cleaning 
immediately since the segment has been dirty for longer than 
max.compaction.lag.ms. 

 

 
{code:java}
def maxCompactionDelay(log: UnifiedLog, firstDirtyOffset: Long, now: Long) : 
Long = {

...

val maxCompactionLagMs = math.max(log.config.maxCompactionLagMs, 0L)
val cleanUntilTime = now - maxCompactionLagMs

if (earliestDirtySegmentTimestamp < cleanUntilTime)
cleanUntilTime - earliestDirtySegmentTimestamp
else
0L
}{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KAFKA-13407) Kafka controller out of service after ZK leader restart

2022-02-24 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13407.
-
Resolution: Fixed

> Kafka controller out of service after ZK leader restart
> ---
>
> Key: KAFKA-13407
> URL: https://issues.apache.org/jira/browse/KAFKA-13407
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 2.8.1
> Environment: Ubuntu 20.04
>Reporter: Daniel
>Priority: Critical
>
> When the Zookeeper leader disappears, a new instance becomes the leader, the 
> instances need to reconnect to Zookeeper, but the Kafka "Controller" gets 
> lost in limbo state after re-establishing connection.
> See below for how I manage to reproduce this over and over.
> *Prerequisites*
> Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which 
> one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the 
> `-L` flag.
> Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is 
> leader by checking
>  
> {code:java}
> echo stat | nc -v localhost 2181
> {code}
>  
>  
> *Reproduce*
> 1. Stop the leader Zookeeper service.
> 2. Watch the logs of the Kafka Controller and ensure that it reconnects and 
> registers again.
>  
> {code:java}
> Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO 
> Unable to read additional data from server sessionid 0x1f2a12870003, likely 
> server has closed socket, closing socket connection and attempting reconnect 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN 
> SASL configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it. (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO 
> Opening socket connection to server 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR 
> [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO 
> Socket connection established, initiating session, client: 
> /10.10.85.215:39338, server: 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO 
> Session establishment complete on server 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, 
> sessionid = 0x1f2a12870003, negotiated timeout = 18000 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO 
> [ZooKeeperClient Kafka server] Reinitializing due to auth failure. 
> (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO 
> [PartitionStateMachine controllerId=1003] Stopped partition state machine 
> (kafka.controller.ZkPartitionStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO 
> [ReplicaStateMachine controllerId=1003] Stopped replica state machine 
> (kafka.controller.ZkReplicaStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INF

[jira] [Resolved] (KAFKA-13603) empty active segment can trigger recovery after clean shutdown and restart

2022-01-27 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13603.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

merged the PR to trunk.

> empty active segment can trigger recovery after clean shutdown and restart
> --
>
> Key: KAFKA-13603
> URL: https://issues.apache.org/jira/browse/KAFKA-13603
> Project: Kafka
>  Issue Type: Bug
>Reporter: Cong Ding
>Assignee: Cong Ding
>Priority: Minor
> Fix For: 3.2.0
>
>
> Within a LogSegment, the TimeIndex and OffsetIndex are lazy indices that 
> don't get created on disk until they are accessed for the first time. If the 
> active segment is empty at the time of the clean shutdown, the disk will have 
> only the log file but no index files.
> However, Log recovery logic expects the presence of an offset index file on 
> disk for each segment, otherwise, the segment is considered corrupted.
> We need to address this issue: create the index files for empty active 
> segments during clean shutdown.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KAFKA-13544) Deadlock during shutting down kafka broker because of connectivity problem with zookeeper

2021-12-17 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13544.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Merged the PR to trunk.

> Deadlock during shutting down kafka broker because of connectivity problem 
> with zookeeper 
> --
>
> Key: KAFKA-13544
> URL: https://issues.apache.org/jira/browse/KAFKA-13544
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.8.1
>Reporter: Andrei Lakhmanets
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: kafka_broker_logs.log, kafka_broker_stackdump.txt
>
>
> Hi team,
> *Kafka version:* 2.8.1
> *Configuration:* 3 kafka brokers in different availability zones and 3 
> zookeeper brokers in different availability zones.
> I faced with deadlock in kafka. I've attached stack dump of the kafka state 
> to this ticket. The locked threads are "feature-zk-node-event-process-thread" 
> and "kafka-shutdown-hook".
> *Context:*
> My kafka cluster had connectivity problems with zookeeper and in the logs I 
> saw the next exception:
> The stacktrace:
> {code:java}
> [2021-12-06 18:31:14,629] WARN Unable to reconnect to ZooKeeper service, 
> session 0x1039563000f has expired (org.apache.zookeeper.ClientCnxn)
> [2021-12-06 18:31:14,629] INFO Unable to reconnect to ZooKeeper service, 
> session 0x1039563000f has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2021-12-06 18:31:14,629] INFO EventThread shut down for session: 
> 0x1039563000f (org.apache.zookeeper.ClientCnxn)
> [2021-12-06 18:31:14,631] INFO [ZooKeeperClient Kafka server] Session 
> expired. (kafka.zookeeper.ZooKeeperClient)
> [2021-12-06 18:31:14,632] ERROR [feature-zk-node-event-process-thread]: 
> Failed to process feature ZK node change event. The broker will eventually 
> exit. 
> (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)
> kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either 
> before or while waiting for connection
>     at 
> kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:279)
>     at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:261)
>     at 
> kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:261)
>     at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1797)
>     at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1767)
>     at 
> kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1762)
>     at kafka.zk.KafkaZkClient.getDataAndStat(KafkaZkClient.scala:771)
>     at kafka.zk.KafkaZkClient.getDataAndVersion(KafkaZkClient.scala:755)
>     at 
> kafka.server.FinalizedFeatureChangeListener$FeatureCacheUpdater.updateLatestOrThrow(FinalizedFeatureChangeListener.scala:74)
>     at 
> kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread.doWork(FinalizedFeatureChangeListener.scala:147)
>     at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code}
> The exception is thrown in feature-zk-node-event-process-thread thread and it 
> is catched in method 
> FinalizedFeatureChangeListener.ChangeNotificationProcessorThread.doWork and 
> then doWork method throws FatalExitError(1).
> The FatalExitError catched in ShutdownableThread.run method and call 
> Exit.exit(e.statusCode()) which calls System.exit under the hood.
> The stackdump of "feature-zk-node-event-process-thread" thread:
> {code:java}
> "feature-zk-node-event-process-thread" #23 prio=5 os_prio=0 cpu=163.19ms 
> elapsed=1563046.32s tid=0x7fd0dcdec800 nid=0x2088 in Object.wait()  
> [0x7fd07e2c1000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@11.0.11/Native Method)
>     - waiting on 
>     at java.lang.Thread.join(java.base@11.0.11/Thread.java:1300)
>     - waiting to re-lock in wait() <0x88b9d3c8> (a 
> org.apache.kafka.common.utils.KafkaThread)
>     at java.lang.Thread.join(java.base@11.0.11/Thread.java:1375)
>     at 
> java.lang.ApplicationShutdownHooks.runHooks(java.base@11.0.11/ApplicationShutdownHooks.java:107)
>     at 
> java.lang.ApplicationShutdownHooks$1.run(java.base@11.0.11/ApplicationShutdownHooks.java:46)
>     at java.lang.Shutdown.runHooks(java.base@11.0.11/Shutdown.java:130)
>     at java.lang.Shutdown.exit(java.base@11.0.11/Shutdown.java:174)
>     - locked <0x806872f8> (a java.lang.Class for java.lang.Shutdown)
>     at java.lang.Runtime.exit(java.base@11.0.11/Runtime.java:116)
>     at java.lang.System.exit(java.base@11.0.11/System.java:1752)
>     at org.apache.kafka.common.utils.Exit$2.execute(Exit.java:43)
>     at org.

[jira] [Resolved] (KAFKA-13551) kafka-log4j-appender-2.1.1.jar Is cve-2021-44228 involved?

2021-12-16 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13551.
-
Resolution: Information Provided

The impact is described in [https://kafka.apache.org/cve-list]

>  kafka-log4j-appender-2.1.1.jar Is cve-2021-44228 involved? 
> 
>
> Key: KAFKA-13551
> URL: https://issues.apache.org/jira/browse/KAFKA-13551
> Project: Kafka
>  Issue Type: Task
>  Components: consumer
>Reporter: xiansheng fu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KAFKA-13202) KIP-768: Extend SASL/OAUTHBEARER with Support for OIDC

2021-10-28 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13202.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

merged the PR to trunk.

> KIP-768: Extend SASL/OAUTHBEARER with Support for OIDC
> --
>
> Key: KAFKA-13202
> URL: https://issues.apache.org/jira/browse/KAFKA-13202
> Project: Kafka
>  Issue Type: New Feature
>  Components: clients, security
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Major
> Fix For: 3.1.0
>
>
> This task is to provide a concrete implementation of the interfaces defined 
> in 
> [KIP-255|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75968876]
>  to allow Kafka to connect to an [OAuth|https://en.wikipedia.org/wiki/OAuth] 
> / [OIDC|https://en.wikipedia.org/wiki/OpenID#OpenID_Connect_(OIDC)] identity 
> provider for authentication and token retrieval. While KIP-255 provides an 
> unsecured JWT example for development, this will fill in the gap and provide 
> a production-grade implementation.
> The OAuth/OIDC work will allow out-of-the-box configuration by any Apache 
> Kafka users to connect to an external identity provider service (e.g. Okta, 
> Auth0, Azure, etc.). The code will implement the standard OAuth 
> {{clientcredentials}} grant type.
> The proposed change is largely composed of a pair of 
> {{AuthenticateCallbackHandler}} implementations: one to login on the client 
> and one to validate on the broker.
> See [KIP-768: Extend SASL/OAUTHBEARER with Support for 
> OIDC|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575]
>  for more detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12802) Add a file based cache for consumed remote log metadata for each partition to avoid consuming again incase of broker restarts.

2021-10-11 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12802.
-
Resolution: Fixed

merged the PR to trunk

> Add a file based cache for consumed remote log metadata for each partition to 
> avoid consuming again incase of broker restarts.
> --
>
> Key: KAFKA-12802
> URL: https://issues.apache.org/jira/browse/KAFKA-12802
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.1.0
>
>
> Add a file based cache for consumed remote log metadata for each partition to 
> avoid consuming again in case of broker restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13305) NullPointerException in LogCleanerManager "uncleanable-bytes" gauge

2021-09-27 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13305.
-
Fix Version/s: 3.1.0
 Assignee: Vincent Jiang
   Resolution: Fixed

merged the PR trunk.

> NullPointerException in LogCleanerManager "uncleanable-bytes" gauge
> ---
>
> Key: KAFKA-13305
> URL: https://issues.apache.org/jira/browse/KAFKA-13305
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Reporter: Vincent Jiang
>Assignee: Vincent Jiang
>Priority: Major
> Fix For: 3.1.0
>
>
> We've seen following exception in production environment:
> {quote} java.lang.NullPointerException: Cannot invoke 
> "kafka.log.UnifiedLog.logStartOffset()" because "log" is null at
> kafka.log.LogCleanerManager$.cleanableOffsets(LogCleanerManager.scala:599)
> {quote}
> Looks like uncleanablePartitions never has partitions removed from it to 
> reflect partition deletion/reassignment.
>  
> We should fix the NullPointerException and removed deleted partitions from 
> uncleanablePartitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13070) LogManager shutdown races with periodic work scheduled by the instance

2021-09-23 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13070.
-
Resolution: Duplicate

> LogManager shutdown races with periodic work scheduled by the instance
> --
>
> Key: KAFKA-13070
> URL: https://issues.apache.org/jira/browse/KAFKA-13070
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kowshik Prakasam
>Assignee: Cong Ding
>Priority: Major
>
> In the LogManager shutdown sequence (in LogManager.shutdown()), we don't 
> cancel the periodic work scheduled by it prior to shutdown. As a result, the 
> periodic work could race with the shutdown sequence causing some unwanted 
> side effects. This is reproducible by a unit test in LogManagerTest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13315) log layer exception during shutdown that caused an unclean shutdown

2021-09-23 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13315.
-
Fix Version/s: 3.0.1
   3.1.0
   Resolution: Fixed

merged the PR to trunk and 3.0.

> log layer exception during shutdown that caused an unclean shutdown
> ---
>
> Key: KAFKA-13315
> URL: https://issues.apache.org/jira/browse/KAFKA-13315
> Project: Kafka
>  Issue Type: Bug
>Reporter: Cong Ding
>Assignee: Cong Ding
>Priority: Major
> Fix For: 3.1.0, 3.0.1
>
>
> We have seen an exception caused by shutting down scheduler before shutting 
> down LogManager.
> When LogManager was closing partitons one by one, scheduler called to delete 
> old segments due to retention. However, the old segments could have been 
> closed by the LogManager, which subsequently marked logdir as offline and 
> didn't write the clean shutdown marker. Ultimately the broker would take 
> hours to restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13313) In KRaft mode, CreateTopic should return the topic configs in the response

2021-09-20 Thread Jun Rao (Jira)
Jun Rao created KAFKA-13313:
---

 Summary: In KRaft mode, CreateTopic should return the topic 
configs in the response
 Key: KAFKA-13313
 URL: https://issues.apache.org/jira/browse/KAFKA-13313
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.0.0
Reporter: Jun Rao


ReplicationControlManager.createTopic() doesn't seem to populate the configs in 
CreatableTopicResult. ZkAdminManager.createTopics() does that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-8522) Tombstones can survive forever

2021-09-16 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-8522.

Fix Version/s: 3.1.0
 Assignee: Richard Yu
   Resolution: Fixed

Finally, merged the PR to trunk. Thanks for the work, [~Yohan123].

> Tombstones can survive forever
> --
>
> Key: KAFKA-8522
> URL: https://issues.apache.org/jira/browse/KAFKA-8522
> Project: Kafka
>  Issue Type: Improvement
>  Components: log cleaner
>Reporter: Evelyn Bayes
>Assignee: Richard Yu
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is a bit grey zone as to whether it's a "bug" but it is certainly 
> unintended behaviour.
>  
> Under specific conditions tombstones effectively survive forever:
>  * Small amount of throughput;
>  * min.cleanable.dirty.ratio near or at 0; and
>  * Other parameters at default.
> What  happens is all the data continuously gets cycled into the oldest 
> segment. Old records get compacted away, but the new records continuously 
> update the timestamp of the oldest segment reseting the countdown for 
> deleting tombstones.
> So tombstones build up in the oldest segment forever.
>  
> While you could "fix" this by reducing the segment size, this can be 
> undesirable as a sudden change in throughput could cause a dangerous number 
> of segments to be created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12988) Change RLMM add/updateRemoteLogSegmentMetadata and putRemotePartitionDeleteMetadata APIS asynchronous.

2021-09-09 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12988.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

merged the PR to trunk.

> Change RLMM add/updateRemoteLogSegmentMetadata and 
> putRemotePartitionDeleteMetadata APIS asynchronous.
> --
>
> Key: KAFKA-12988
> URL: https://issues.apache.org/jira/browse/KAFKA-12988
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13225) Controller skips sending UpdateMetadataRequest when shutting down broker doesnt host partitions

2021-09-02 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13225.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

merged the PR to trunk

> Controller skips sending UpdateMetadataRequest when shutting down broker 
> doesnt host partitions 
> 
>
> Key: KAFKA-13225
> URL: https://issues.apache.org/jira/browse/KAFKA-13225
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: David Mao
>Assignee: David Mao
>Priority: Minor
> Fix For: 3.1.0
>
>
> If a broker not hosting replicas for any partitions is shut down while there 
> are offline partitions, the controller can fail to send out metadata updates 
> to other brokers in the cluster.
>  
> Since this is a very niche scenario, I will leave the priority as Minor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13198) TopicsDelta doesn't update deleted topic when processing PartitionChangeRecord

2021-08-17 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13198.
-
Resolution: Fixed

merged the PR to 3.0 and trunk.

> TopicsDelta doesn't update deleted topic when processing PartitionChangeRecord
> --
>
> Key: KAFKA-13198
> URL: https://issues.apache.org/jira/browse/KAFKA-13198
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft, replication
>Reporter: Jose Armando Garcia Sancio
>Assignee: Jose Armando Garcia Sancio
>Priority: Blocker
>  Labels: kip-500
> Fix For: 3.0.0
>
>
> In KRaft when a replica gets reassigned away from a topic partition we are 
> not notifying the {{ReplicaManager}} to stop the replica.
> On solution is to track those topic partition ids when processing 
> {{PartitionChangeRecord}} and to returned them as {{deleted}} when the 
> replica manager calls {{calculateDeltaChanges}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13206) shutting down broker needs to stop fetching as a follower in KRaft mode

2021-08-16 Thread Jun Rao (Jira)
Jun Rao created KAFKA-13206:
---

 Summary: shutting down broker needs to stop fetching as a follower 
in KRaft mode
 Key: KAFKA-13206
 URL: https://issues.apache.org/jira/browse/KAFKA-13206
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 3.0.0
Reporter: Jun Rao


In the ZK mode, the controller will send a stopReplica(with deletion flag as 
false) request to the shutting down broker so that it will stop the followers 
from fetching. In KRaft mode, we don't have a corresponding logic. This means 
unnecessary rejected fetch follower requests during controlled shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13194) LogCleaner may clean past highwatermark

2021-08-13 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13194.
-
Fix Version/s: 3.1.0
 Assignee: Lucas Bradstreet
   Resolution: Fixed

Merged the PR to trunk.

> LogCleaner may clean past highwatermark
> ---
>
> Key: KAFKA-13194
> URL: https://issues.apache.org/jira/browse/KAFKA-13194
> Project: Kafka
>  Issue Type: Bug
>Reporter: Lucas Bradstreet
>Assignee: Lucas Bradstreet
>Priority: Minor
> Fix For: 3.1.0
>
>
> Here we have the cleaning point being bounded to the active segment base 
> offset and the first unstable offset. Which makes sense:
>  
> {code:java}
>// find first segment that cannot be cleaned
> // neither the active segment, nor segments with any messages closer to 
> the head of the log than the minimum compaction lag time
> // may be cleaned
> val firstUncleanableDirtyOffset: Long = Seq(  // we do not clean 
> beyond the first unstable offset
>   log.firstUnstableOffset,  // the active segment is always 
> uncleanable
>   Option(log.activeSegment.baseOffset),  // the first segment whose 
> largest message timestamp is within a minimum time lag from now
>   if (minCompactionLagMs > 0) {
> // dirty log segments
> val dirtyNonActiveSegments = 
> log.localNonActiveLogSegmentsFrom(firstDirtyOffset)
> dirtyNonActiveSegments.find { s =>
>   val isUncleanable = s.largestTimestamp > now - minCompactionLagMs
>   debug(s"Checking if log segment may be cleaned: log='${log.name}' 
> segment.baseOffset=${s.baseOffset} " +
> s"segment.largestTimestamp=${s.largestTimestamp}; now - 
> compactionLag=${now - minCompactionLagMs}; " +
> s"is uncleanable=$isUncleanable")
>   isUncleanable
> }.map(_.baseOffset)
>   } else None
> ).flatten.min
> {code}
>  
> But LSO starts out as None.
> {code:java}
> @volatile private var firstUnstableOffsetMetadata: Option[LogOffsetMetadata] 
> = None
> private[log] def firstUnstableOffset: Option[Long] = 
> firstUnstableOffsetMetadata.map(_.messageOffset){code}
> For most code depending on the LSO, fetchLastStableOffsetMetadata is used to 
> default it to the hwm if it's not set.
>  
> {code:java}
>   private def fetchLastStableOffsetMetadata: LogOffsetMetadata = {
> checkIfMemoryMappedBufferClosed()// cache the current high watermark 
> to avoid a concurrent update invalidating the range check
> val highWatermarkMetadata = fetchHighWatermarkMetadata
> firstUnstableOffsetMetadata match {
>   case Some(offsetMetadata) if offsetMetadata.messageOffset < 
> highWatermarkMetadata.messageOffset =>
> if (offsetMetadata.messageOffsetOnly) {
>   lock synchronized {
> val fullOffset = 
> convertToOffsetMetadataOrThrow(offsetMetadata.messageOffset)
> if (firstUnstableOffsetMetadata.contains(offsetMetadata))
>   firstUnstableOffsetMetadata = Some(fullOffset)
> fullOffset
>   }
> } else {
>   offsetMetadata
> }
>   case _ => highWatermarkMetadata
> }
>   }
> {code}
>  
>  
> This means that in the case where the hwm is prior to the active segment 
> base, the log cleaner may clean past the hwm. This is most likely to occur 
> after a broker restart when the log cleaner may start cleaning prior to 
> replication becoming active.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13068) Rename Log to UnifiedLog

2021-08-12 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13068.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Merged the PR to trunk

> Rename Log to UnifiedLog
> 
>
> Key: KAFKA-13068
> URL: https://issues.apache.org/jira/browse/KAFKA-13068
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.1.0
>
>
> Once KAFKA-12554 is completed, we can rename Log -> UnifiedLog as described 
> in the doc:  
> [https://docs.google.com/document/d/1dQJL4MCwqQJSPmZkVmVzshFZKuFy_bCPtubav4wBfHQ/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9555) Topic-based implementation for the RemoteLogMetadataManager

2021-07-19 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-9555.

Resolution: Fixed

merged the PR to trunk.

> Topic-based implementation for the RemoteLogMetadataManager
> ---
>
> Key: KAFKA-9555
> URL: https://issues.apache.org/jira/browse/KAFKA-9555
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Alexandre Dupriez
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.1.0
>
>
> The purpose of this task is to implement a {{RemoteLogMetadataManager}} based 
> on an internal topic in Kafka. More details ar mentioned in the 
> [KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-RemoteLogMetadataManagerimplementedwithaninternaltopic].
> Done means:
>  - Pull Request available for review and unit-tests.
> System and integration tests are out of scope of this task and will be part 
> of another task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13092) Perf regression in LISR requests

2021-07-15 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-13092.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to 3.0 and trunk.

> Perf regression in LISR requests
> 
>
> Key: KAFKA-13092
> URL: https://issues.apache.org/jira/browse/KAFKA-13092
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Critical
> Fix For: 3.0.0
>
>
> With the addition of partition metadata files, we have an extra operation to 
> do when handling LISR requests. This really slows down the processing, so we 
> should flush asynchronously to fix this regression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12554) Split Log layer into Log and LocalLog

2021-07-13 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12554.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

merged the PR to trunk.

> Split Log layer into Log and LocalLog
> -
>
> Key: KAFKA-12554
> URL: https://issues.apache.org/jira/browse/KAFKA-12554
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.1.0
>
>
> Split Log layer into Log and LocalLog based on the proposal described in this 
> document: 
> [https://docs.google.com/document/d/1dQJL4MCwqQJSPmZkVmVzshFZKuFy_bCPtubav4wBfHQ/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10580) Add topic ID support to Fetch request

2021-07-07 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10580.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

merged the PR to trunk.

> Add topic ID support to Fetch request
> -
>
> Key: KAFKA-10580
> URL: https://issues.apache.org/jira/browse/KAFKA-10580
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.1.0
>
>
> Prevent fetching a stale topic with topic IDs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12964) Corrupt segment recovery can delete new producer state snapshots

2021-07-01 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12964.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Corrupt segment recovery can delete new producer state snapshots
> 
>
> Key: KAFKA-12964
> URL: https://issues.apache.org/jira/browse/KAFKA-12964
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.8.0
>Reporter: Gardner Vickers
>Assignee: Gardner Vickers
>Priority: Major
> Fix For: 3.0.0
>
>
> During log recovery, we may schedule asynchronous deletion in 
> deleteSegmentFiles.
> [https://github.com/apache/kafka/blob/fc5245d8c37a6c9d585c5792940a8f9501bedbe1/core/src/main/scala/kafka/log/Log.scala#L2382]
> If we're truncating the log, this may result in deletions for segments with 
> matching base offsets to segments which will be written in the future. To 
> avoid asynchronously deleting future segments, we rename the segment and 
> index files, but we do not do this for producer state snapshot files. 
> This leaves us vulnerable to a race condition where we could end up deleting 
> snapshot files for segments written after log recovery when async deletion 
> runs.
>  
> To fix this, we should first remove the `SnapshotFile` from the 
> `ProducerStateManager` and rename the file to have a `Log.DeletedFileSuffix`. 
> Then we can asynchronously delete the snapshot file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12520) Producer state is needlessly rebuilt on startup

2021-06-29 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12520.
-
Fix Version/s: 3.0.0
 Assignee: Cong Ding  (was: Dhruvil Shah)
   Resolution: Fixed

merged the PR to trunk

> Producer state is needlessly rebuilt on startup
> ---
>
> Key: KAFKA-12520
> URL: https://issues.apache.org/jira/browse/KAFKA-12520
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dhruvil Shah
>Assignee: Cong Ding
>Priority: Major
> Fix For: 3.0.0
>
>
> When we find a {{.swap}} file on startup, we typically want to rename and 
> replace it as {{.log}}, {{.index}}, {{.timeindex}}, etc. as a way to complete 
> any ongoing replace operations. These swap files are usually known to have 
> been flushed to disk before the replace operation begins.
> One flaw in the current logic is that when we recover these swap files on 
> startup, we end up truncating the producer state and rebuild it from scratch. 
> This is unneeded as the replace operation does not mutate the producer state 
> by itself. It is only meant to replace the {{.log}} file along with 
> corresponding indices.
> Because of this unneeded producer state rebuild operation, we have seen 
> multi-hour startup times for clusters that have large compacted topics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12816) Add tier storage configs.

2021-06-18 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12816.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Add tier storage configs. 
> --
>
> Key: KAFKA-12816
> URL: https://issues.apache.org/jira/browse/KAFKA-12816
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.0.0
>
>
> Add all the tier storage related configurations including remote log manager, 
> remote storage manager, and remote log metadata manager. 
> These configs are described in the KIP-405 
> [here|https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Configs.1].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12955) Fix LogLoader to pass materialized view of segments for deletion

2021-06-16 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12955.
-
Fix Version/s: 3.0.0
 Assignee: Kowshik Prakasam
   Resolution: Fixed

Merged the PR to trunk.

> Fix LogLoader to pass materialized view of segments for deletion
> 
>
> Key: KAFKA-12955
> URL: https://issues.apache.org/jira/browse/KAFKA-12955
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Critical
> Fix For: 3.0.0
>
>
> Within {{LogLoader.removeAndDeleteSegmentsAsync()}}, we should force 
> materialization of the {{segmentsToDelete}} iterable, to make sure the 
> results of the iteration remain valid and deterministic. We should also pass 
> only the materialized view to the logic that deletes the segments. Otherwise, 
> we could end up deleting the wrong segments asynchronously.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12852) Log.splitOverflowedSegment() doesn't generate producer snapshot for new segment

2021-05-26 Thread Jun Rao (Jira)
Jun Rao created KAFKA-12852:
---

 Summary: Log.splitOverflowedSegment() doesn't generate producer 
snapshot for new segment
 Key: KAFKA-12852
 URL: https://issues.apache.org/jira/browse/KAFKA-12852
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Jun Rao


Currently, we expect every segment (except the very first one) to have a 
corresponding producer snapshot file.  However, in 
Log.splitOverflowedSegment(), we don't seem to create producerSnapshot for the 
split segments with a new based offset. This may cause issue with tier storage 
since we expect each archived segment to have a producer snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12758) Create a new `server-common` module and move ApiMessageAndVersion, RecordSerde, AbstractApiMessageSerde, and BytesApiMessageSerde to that module.

2021-05-11 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12758.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Create a new `server-common` module and move ApiMessageAndVersion, 
> RecordSerde, AbstractApiMessageSerde, and BytesApiMessageSerde to that module.
> -
>
> Key: KAFKA-12758
> URL: https://issues.apache.org/jira/browse/KAFKA-12758
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.0.0
>
>
> Create a new `server-common` module and move ApiMessageAndVersion, 
> RecordSerde, AbstractApiMessageSerde, and BytesApiMessageSerde to that module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12429) Serdes for all message types in internal topic which is used in default implementation for RLMM.

2021-05-05 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12429.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Serdes for all message types in internal topic which is used in default 
> implementation for RLMM.
> 
>
> Key: KAFKA-12429
> URL: https://issues.apache.org/jira/browse/KAFKA-12429
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.0.0
>
>
> RLMM default implementation is based on storing all the metadata in an 
> internal topic.
> We need serdes and format of the message types that need to be stored in the 
> topic.
> You can see more details in the 
> [KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MessageFormat]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12553) Refactor Log layer recovery logic

2021-04-20 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12553.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Refactor Log layer recovery logic
> -
>
> Key: KAFKA-12553
> URL: https://issues.apache.org/jira/browse/KAFKA-12553
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.0.0
>
>
> Refactor Log layer recovery logic by extracting it out of the kafka.log.Log 
> class into separate modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12368) Inmemory implementation of RSM and RLMM.

2021-04-13 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12368.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Inmemory implementation of RSM and RLMM. 
> -
>
> Key: KAFKA-12368
> URL: https://issues.apache.org/jira/browse/KAFKA-12368
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-3968) fsync() is not called on parent directory when new FileMessageSet is flushed to disk

2021-04-02 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-3968.

Fix Version/s: 3.0.0
 Assignee: Cong Ding
   Resolution: Fixed

Merged to trunk. Thanks Cong for fixing this long lasting issue.

> fsync() is not called on parent directory when new FileMessageSet is flushed 
> to disk
> 
>
> Key: KAFKA-3968
> URL: https://issues.apache.org/jira/browse/KAFKA-3968
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.9.0.1, 0.10.0.0
> Environment: Linux, ext4 filesystem
>Reporter: Andrey Neporada
>Assignee: Cong Ding
>Priority: Major
>  Labels: reliability
> Fix For: 3.0.0
>
>
> Kafka does not call fsync() on directory when new log segment is created and 
> flushed to disk.
> The problem is that following sequence of calls doesn't guarantee file 
> durability: 
> fd = open("log", O_RDWR | O_CREATE); // suppose open creates "log"
> write(fd);
> fsync(fd);
> If system crashes after fsync() but before parent directory have been flushed 
> to disk, the log file can disappear.
> This is true at least for ext4 on Linux.
> Proposed solution is to flush directory when flush() is called for the first 
> time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12575) Eliminate Log.isLogDirOffline boolean attribute

2021-03-31 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12575.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Eliminate Log.isLogDirOffline boolean attribute
> ---
>
> Key: KAFKA-12575
> URL: https://issues.apache.org/jira/browse/KAFKA-12575
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.0.0
>
>
> This attribute was added in [https://github.com/apache/kafka/pull/9676] but 
> it is redundant and can be eliminated in favor of looking up 
> LogDirFailureChannel. The performance implication of a hash map inside 
> LogDirFailureChannel lookup should be low/none.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12571) Eliminate LeaderEpochFileCache constructor dependency on LogEndOffset

2021-03-30 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12571.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Eliminate LeaderEpochFileCache constructor dependency on LogEndOffset
> -
>
> Key: KAFKA-12571
> URL: https://issues.apache.org/jira/browse/KAFKA-12571
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.0.0
>
>
> *This is a precursor to KAFKA-12553.*
> Before refactoring the recovery logic (KAFKA-12553), we would like to move 
> the logic to [initialize 
> LeaderEpochFileCache|https://github.com/apache/kafka/blob/d9bb2ef596343da9402bff4903b129cff1f7c22b/core/src/main/scala/kafka/log/Log.scala#L579]
>  outside the Log class.  Once we suitably initialize LeaderEpochFileCache 
> outside Log, we will be able pass it as a dependency into both the Log class 
> constructor and the recovery module. However, the LeaderEpochFileCache 
> constructor takes a 
> [dependency|https://github.com/apache/kafka/blob/d9bb2ef596343da9402bff4903b129cff1f7c22b/core/src/main/scala/kafka/server/epoch/LeaderEpochFileCache.scala#L42]
>  on logEndOffset (via a callback). This dependency prevents the instantiation 
> of LeaderEpochFileCache outside Log class.
> This situation blocks the recovery logic (KAFKA-12553) refactor. So this 
> constructor dependency on logEndOffset needs to be eliminated.
> It turns out the logEndOffset dependency is used only in 1 of the 
> LeaderEpochFileCache methods: LeaderEpochFileCache.endOffsetFor, and just for 
> 1 particular [case|#L201]. Therefore, it should be possible to modify this so 
> that   we only pass the logEndOffset as a parameter into endOffsetFor 
> whenever the method is called.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12552) Extract segments map out of Log class into separate class

2021-03-30 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12552.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Merged the PR to trunk.

> Extract segments map out of Log class into separate class
> -
>
> Key: KAFKA-12552
> URL: https://issues.apache.org/jira/browse/KAFKA-12552
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.0.0
>
>
> *This is a precursor to KAFKA-12553.*
> Extract segments map out of Log class into separate class. This will improve 
> the testability and maintainability of the Log layer, and also will be useful 
> to subsequently refactor the recovery logic (see KAFKA-12553).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9548) SPI - RemoteStorageManager and RemoteLogMetadataManager interfaces and related classes.

2021-03-03 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-9548.

Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> SPI - RemoteStorageManager and RemoteLogMetadataManager interfaces and 
> related classes.
> ---
>
> Key: KAFKA-9548
> URL: https://issues.apache.org/jira/browse/KAFKA-9548
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Reporter: Satish Duggana
>Assignee: Satish Duggana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12177) Retention is not idempotent

2021-03-02 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-12177.
-
Fix Version/s: 3.0.0
 Assignee: Lucas Bradstreet
   Resolution: Fixed

merged the PR to trunk

> Retention is not idempotent
> ---
>
> Key: KAFKA-12177
> URL: https://issues.apache.org/jira/browse/KAFKA-12177
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Lucas Bradstreet
>Assignee: Lucas Bradstreet
>Priority: Minor
> Fix For: 3.0.0
>
>
> Kafka today applies retention in the following order:
>  # Time
>  # Size
>  # Log start offset
> Today it is possible for a segment with offsets less than the log start 
> offset to contain data that is not deletable due to time retention. This 
> means that it's possible for log start offset retention to unblock further 
> deletions as a result of time based retention. Note that this does require a 
> case where the max timestamp for each segment increases, decreases and then 
> increases again. Even so it would be nice to make retention idempotent by 
> applying log start offset retention first, followed by size and time. This 
> would also be potentially cheaper to perform as neither log start offset and 
> size retention require the maxTimestamp for a segment to be loaded from disk 
> after a broker restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9672) Dead brokers in ISR cause isr-expiration to fail with exception

2021-02-19 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-9672.

Fix Version/s: 3.0.0
   Resolution: Fixed

merged the PR to trunk

> Dead brokers in ISR cause isr-expiration to fail with exception
> ---
>
> Key: KAFKA-9672
> URL: https://issues.apache.org/jira/browse/KAFKA-9672
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Ivan Yurchenko
>Assignee: Jose Armando Garcia Sancio
>Priority: Major
> Fix For: 3.0.0
>
>
> We're running Kafka 2.4 and facing a pretty strange situation.
>  Let's say there were three brokers in the cluster 0, 1, and 2. Then:
>  1. Broker 3 was added.
>  2. Partitions were reassigned from broker 0 to broker 3.
>  3. Broker 0 was shut down (not gracefully) and removed from the cluster.
>  4. We see the following state in ZooKeeper:
> {code:java}
> ls /brokers/ids
> [1, 2, 3]
> get /brokers/topics/foo
> {"version":2,"partitions":{"0":[2,1,3]},"adding_replicas":{},"removing_replicas":{}}
> get /brokers/topics/foo/partitions/0/state
> {"controller_epoch":123,"leader":1,"version":1,"leader_epoch":42,"isr":[0,2,3,1]}
> {code}
> It means, the dead broker 0 remains in the partitions's ISR. A big share of 
> the partitions in the cluster have this issue.
> This is actually causing an errors:
> {code:java}
> Uncaught exception in scheduled task 'isr-expiration' 
> (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id 
> 12 is not available on broker 17
> {code}
> It means that effectively {{isr-expiration}} task is not working any more.
> I have a suspicion that this was introduced by [this commit (line 
> selected)|https://github.com/apache/kafka/commit/57baa4079d9fc14103411f790b9a025c9f2146a4#diff-5450baca03f57b9f2030f93a480e6969R856]
> Unfortunately, I haven't been able to reproduce this in isolation.
> Any hints about how to reproduce (so I can write a patch) or mitigate the 
> issue on a running cluster are welcome.
> Generally, I assume that not throwing {{ReplicaNotAvailableException}} on a 
> dead (i.e. non-existent) broker, considering them out-of-sync and removing 
> from the ISR should fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12153) follower can hit OffsetOutOfRangeException during truncation

2021-01-06 Thread Jun Rao (Jira)
Jun Rao created KAFKA-12153:
---

 Summary: follower can hit OffsetOutOfRangeException during 
truncation
 Key: KAFKA-12153
 URL: https://issues.apache.org/jira/browse/KAFKA-12153
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.7.0
Reporter: Jun Rao
Assignee: Jason Gustafson


Currently, we have the following code path.

log.truncateTo() => updateLogEndOffset() => updateHighWatermarkMetadata() => 
maybeIncrementFirstUnstableOffset() => convertToOffsetMetadataOrThrow() => 
read()

This path seems problematic. The issue is that updateLogEndOffset() is called 
before loadProducerState() in log.truncateTo(). At that point, the 
producerState is not reflecting the truncated state yet and 
producerStateManager.firstUnstableOffset(called in 
maybeIncrementFirstUnstableOffset() to feed read()) could return an offset 
larger than the truncated logEndOffset, which will lead to 
OffsetOutOfRangeException.

 

This issue is relatively rare since it requires truncation below the high 
watermark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10832) Recovery logic is using incorrect ProducerStateManager instance when updating producers

2020-12-11 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10832.
-
Fix Version/s: 2.8.0
   Resolution: Fixed

Merged the PR to trunk.

> Recovery logic is using incorrect ProducerStateManager instance when updating 
> producers 
> 
>
> Key: KAFKA-10832
> URL: https://issues.apache.org/jira/browse/KAFKA-10832
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 2.8.0
>
>
> The bug is that from within {{Log.updateProducers(…)}}, the code operates on 
> the {{producerStateManager}} attribute of the {{Log}} instance instead of 
> operating on an input parameter. Please see 
> [this|https://github.com/apache/kafka/blob/1d84f543678c4c08800bc3ea18c04a9db8adf7e4/core/src/main/scala/kafka/log/Log.scala#L1464]
>  LOC where it calls {{producerStateManager.prepareUpdate}} thus accessing the 
> attribute from the {{Log}} object (see 
> [this|https://github.com/apache/kafka/blob/1d84f543678c4c08800bc3ea18c04a9db8adf7e4/core/src/main/scala/kafka/log/Log.scala#L251]).
>  This looks unusual particularly for {{Log.loadProducersFromLog(...)}} 
> [path|https://github.com/apache/kafka/blob/1d84f543678c4c08800bc3ea18c04a9db8adf7e4/core/src/main/scala/kafka/log/Log.scala#L956].
>  Here I believe we should be using the instance passed to the method, rather 
> than the attribute from the {{Log}} instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10723) LogManager leaks internal thread pool activity during shutdown

2020-11-19 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10723.
-
Fix Version/s: 2.8.0
   Resolution: Fixed

merged to trunk

> LogManager leaks internal thread pool activity during shutdown
> --
>
> Key: KAFKA-10723
> URL: https://issues.apache.org/jira/browse/KAFKA-10723
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 2.8.0
>
>
> *TL;DR:*
> The asynchronous shutdown in {{LogManager}} has the shortcoming that if 
> during shutdown any of the internal futures fail, then we do not always 
> ensure that all futures are completed before {{LogManager.shutdown}} returns. 
> As a result, despite the shut down completed message from KafkaServer is seen 
> in the error logs, some futures continue to run from inside LogManager 
> attempting to close the logs. This is misleading and it could possibly break 
> the general rule of avoiding post-shutdown activity in the Broker.
> *Description:*
> When LogManager is shutting down, exceptions in log closure are handled 
> [here|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L497-L501].
>  However, this 
> [line|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L502]
>  in the finally clause shuts down the thread pools *asynchronously*. The 
> code: _threadPools.foreach(.shutdown())_ initiates an orderly shutdown (for 
> each thread pool) in which previously submitted tasks are executed, but no 
> new tasks will be accepted (see javadoc link 
> [here|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown()])_._
>  As a result, if there is an exception during log closure, some of the thread 
> pools which are closing logs could be leaked and continue to run in the 
> background, after the control returns to the caller (i.e. {{KafkaServer}}). 
> As a result, even after the "shut down completed" message is seen in the 
> error logs (originating from {{KafkaServer}} shutdown sequence), log closures 
> continue to happen in the background, which is misleading.
>   
> *Proposed options for fixes:*
> It seems useful that we maintain the contract with {{KafkaServer}} that after 
> {{LogManager.shutdown}} is called once, all tasks that close the logs are 
> guaranteed to have completed before the call returns. There are probably 
> couple different ways to fix this:
>  # Replace {{_threadPools.foreach(.shutdown())_ with 
> _threadPools.foreach(.awaitTermination())_}}{{.}} This ensures that we wait 
> for all threads to be shutdown before returning the {{_LogManager.shutdown_}} 
> call.
>  # Skip creating of checkpoint and clean shutdown file only for the affected 
> directory if any of its futures throw an error. We continue to wait for all 
> futures to complete for all directories. This can require some changes to 
> [this for 
> loop|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L481-L496],
>  so that we wait for all futures to complete regardless of whether one of 
> them threw an error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10624) [Easy] FeatureZNodeStatus should use sealed trait instead of Enumeration

2020-11-05 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10624.
-
Fix Version/s: 2.8.0
   Resolution: Fixed

merged the PR to trunk.

> [Easy] FeatureZNodeStatus should use sealed trait instead of Enumeration
> 
>
> Key: KAFKA-10624
> URL: https://issues.apache.org/jira/browse/KAFKA-10624
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Minor
> Fix For: 2.8.0
>
>
> In Scala, we prefer sealed traits over Enumeration since the former gives you 
> exhaustiveness checking. With Scala Enumeration, you don't get a warning if 
> you add a new value that is not handled in a given pattern match.
> This Jira tracks refactoring enum 
> [FeatureZNodeStatus|https://github.com/apache/kafka/blob/fb4f297207ef62f71e4a6d2d0dac75752933043d/core/src/main/scala/kafka/zk/ZkData.scala#L801]
>  from an enum to a sealed trait. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10599) Implement basic CLI tool for feature versioning system

2020-10-20 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10599.
-
Fix Version/s: 2.7.0
   Resolution: Fixed

Merged the PR to 2.7 and trunk.

> Implement basic CLI tool for feature versioning system
> --
>
> Key: KAFKA-10599
> URL: https://issues.apache.org/jira/browse/KAFKA-10599
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 2.7.0
>
>
> Implement a basic CLI tool for the feature versioning system providing the 
> basic facilities as explained in this section of KIP-584: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP584:Versioningschemeforfeatures-BasicCLItoolusage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9393) DeleteRecords may cause extreme lock contention for large partition directories

2020-10-09 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-9393.

Fix Version/s: 2.7.0
 Assignee: Gardner Vickers
   Resolution: Fixed

merged to trunk.

> DeleteRecords may cause extreme lock contention for large partition 
> directories
> ---
>
> Key: KAFKA-9393
> URL: https://issues.apache.org/jira/browse/KAFKA-9393
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Lucas Bradstreet
>Assignee: Gardner Vickers
>Priority: Major
> Fix For: 2.7.0
>
>
> DeleteRecords, frequently used by KStreams triggers a 
> Log.maybeIncrementLogStartOffset call, calling 
> kafka.log.ProducerStateManager.listSnapshotFiles which calls 
> java.io.File.listFiles on the partition dir. The time taken to list this 
> directory can be extreme for partitions with many small segments (e.g 2) 
> taking multiple seconds to finish. This causes lock contention for the log, 
> and if produce requests are also occurring for the same log can cause a 
> majority of request handler threads to become blocked waiting for the 
> DeleteRecords call to finish.
> I believe this is a problem going back to the initial implementation of the 
> transactional producer, but I need to confirm how far back it goes.
> One possible solution is to maintain a producer state snapshot aligned to the 
> log segment, and simply delete it whenever we delete a segment. This would 
> ensure that we never have to perform a directory scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10584) IndexSearchType should use sealed trait instead of Enumeration

2020-10-07 Thread Jun Rao (Jira)
Jun Rao created KAFKA-10584:
---

 Summary: IndexSearchType should use sealed trait instead of 
Enumeration
 Key: KAFKA-10584
 URL: https://issues.apache.org/jira/browse/KAFKA-10584
 Project: Kafka
  Issue Type: Improvement
  Components: core
Reporter: Jun Rao


In Scala, we prefer sealed traits over Enumeration since the former gives you 
exhaustiveness checking. With Scala Enumeration, you don't get a warning if you 
add a new value that is not handled in a given pattern match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10028) Implement write path for feature versioning scheme

2020-10-07 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10028.
-
Fix Version/s: 2.7.0
   Resolution: Fixed

merged the PR to trunk

> Implement write path for feature versioning scheme
> --
>
> Key: KAFKA-10028
> URL: https://issues.apache.org/jira/browse/KAFKA-10028
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 2.7.0
>
>
> Goal is to implement various classes and integration for the write path of 
> the feature versioning system 
> ([KIP-584|https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features]).
>  This is preceded by the read path implementation (KAFKA-10027). The write 
> path implementation involves developing the new controller API: 
> UpdateFeatures that enables transactional application of a set of 
> cluster-wide feature updates to the ZK {{'/features'}} node, along with 
> required ACL permissions.
>  
> Details about the write path are explained [in this 
> part|https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-ChangestoKafkaController]
>  of the KIP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10300) fix flaky core/group_mode_transactions_test.py

2020-07-23 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10300.
-
Fix Version/s: 2.7.0
   Resolution: Fixed

merged to trunk

> fix flaky core/group_mode_transactions_test.py
> --
>
> Key: KAFKA-10300
> URL: https://issues.apache.org/jira/browse/KAFKA-10300
> Project: Kafka
>  Issue Type: Bug
>  Components: core, system tests
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Major
> Fix For: 2.7.0
>
>
> {quote}
> test_id:    
> kafkatest.tests.core.group_mode_transactions_test.GroupModeTransactionsTest.test_transactions.failure_mode=hard_bounce.bounce_target=brokers
> status:     FAIL
> run time:   9 minutes 47.698 seconds
>  
>  
>     copier-0 - Failed to copy all messages in 240s.
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/ducktape/tests/runner_client.py", 
> line 134, in run
>     data = self.run_test()
>   File 
> "/usr/local/lib/python2.7/dist-packages/ducktape/tests/runner_client.py", 
> line 192, in run_test
>     return self.test_context.function(self.test)
>   File "/usr/local/lib/python2.7/dist-packages/ducktape/mark/_mark.py", line 
> 429, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/opt/kafka-dev/tests/kafkatest/tests/core/group_mode_transactions_test.py", 
> line 271, in test_transactions
>     num_messages_to_copy=self.num_seed_messages)
>   File 
> "/opt/kafka-dev/tests/kafkatest/tests/core/group_mode_transactions_test.py", 
> line 230, in copy_messages_transactionally
>     (copier.transactional_id, copier_timeout_sec))
>   File "/usr/local/lib/python2.7/dist-packages/ducktape/utils/util.py", line 
> 41, in wait_until
>     raise TimeoutError(err_msg() if callable(err_msg) else err_msg)
> TimeoutError: copier-0 - Failed to copy all messages in 240s.
> {quote}
>  
> this issue is same to KAFKA-10274 so we can apply the same approach



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10274) Transaction system test uses inconsistent timeouts

2020-07-22 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10274.
-
Fix Version/s: 2.6.0
   Resolution: Fixed

Merged this to trunk and 2.6 branch.

> Transaction system test uses inconsistent timeouts
> --
>
> Key: KAFKA-10274
> URL: https://issues.apache.org/jira/browse/KAFKA-10274
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.6.0
>
>
> We've seen some failures in the transaction system test with errors like the 
> following:
> {code}
> copier-1 : Message copier didn't make enough progress in 30s. Current 
> progress: 0
> {code}
> Looking at the consumer logs, we see the following messages repeating over 
> and over:
> {code}
> [2020-07-14 06:50:21,466] DEBUG [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] Fetching committed offsets for 
> partitions: [input-topic-1] 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2020-07-14 06:50:21,468] DEBUG [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] Failed to fetch offset for 
> partition input-topic-1: There are unstable offsets that need to be cleared. 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> {code}
> I think the problem is that the test implicitly depends on the transaction 
> timeout which has been configured to 40s even though it expects progress 
> after 30s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10257) system test kafkatest.tests.core.security_rolling_upgrade_test fails

2020-07-15 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10257.
-
Fix Version/s: 2.6.0
   Resolution: Fixed

Merged to 2.6 and trunk.

> system test kafkatest.tests.core.security_rolling_upgrade_test fails
> 
>
> Key: KAFKA-10257
> URL: https://issues.apache.org/jira/browse/KAFKA-10257
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Jun Rao
>Assignee: Chia-Ping Tsai
>Priority: Major
> Fix For: 2.6.0
>
>
> The test failure was reported in 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-08--001.1594266883--chia7712--KAFKA-10235--a76224fff/report.html
> Saw the following error in the log.
> {code:java}
> [2020-07-09 00:56:37,575] ERROR [KafkaServer id=1] Fatal error during 
> KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
> java.lang.IllegalArgumentException: Could not find a 'KafkaServer' or 
> 'internal.KafkaServer' entry in the JAAS configuration. System property 
> 'java.security.auth.login.config' is not set
> at 
> org.apache.kafka.common.security.JaasContext.defaultContext(JaasContext.java:133)
> at 
> org.apache.kafka.common.security.JaasContext.load(JaasContext.java:98)
> at 
> org.apache.kafka.common.security.JaasContext.loadServerContext(JaasContext.java:70)
> at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:131)
> at 
> org.apache.kafka.common.network.ChannelBuilders.serverChannelBuilder(ChannelBuilders.java:97)
> at kafka.network.Processor.(SocketServer.scala:780)
> at kafka.network.SocketServer.newProcessor(SocketServer.scala:406)
> at 
> kafka.network.SocketServer.$anonfun$addDataPlaneProcessors$1(SocketServer.scala:285)
> at 
> kafka.network.SocketServer.addDataPlaneProcessors(SocketServer.scala:284)
> at 
> kafka.network.SocketServer.$anonfun$createDataPlaneAcceptorsAndProcessors$1(SocketServer.scala:251)
> at 
> kafka.network.SocketServer.$anonfun$createDataPlaneAcceptorsAndProcessors$1$adapted(SocketServer.scala:248)
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
> at 
> kafka.network.SocketServer.createDataPlaneAcceptorsAndProcessors(SocketServer.scala:248)
> at kafka.network.SocketServer.startup(SocketServer.scala:122)
> at kafka.server.KafkaServer.startup(KafkaServer.scala:297)
> at 
> kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
> at kafka.Kafka$.main(Kafka.scala:82)
> at kafka.Kafka.main(Kafka.scala)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10002) Improve performances of StopReplicaRequest with large number of partitions to be deleted

2020-07-13 Thread Jun Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-10002.
-
Fix Version/s: 2.7.0
   Resolution: Fixed

merged the PR to trunk

> Improve performances of StopReplicaRequest with large number of partitions to 
> be deleted
> 
>
> Key: KAFKA-10002
> URL: https://issues.apache.org/jira/browse/KAFKA-10002
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Jacot
>Assignee: David Jacot
>Priority: Major
> Fix For: 2.7.0
>
>
> I have noticed that StopReplicaRequests with partitions to be deleted are 
> extremely slow when there is more than 2000 partitions which leads to hitting 
> the request timeout in the controller. A request with 2000 partitions to be 
> deleted still works but performances degrades significantly with the number 
> increases. For examples, a request with 3000 partitions to be deletes takes 
> appox. 60 seconds to be processed.
> A CPU profile shows that most of the time is spent in checkpointing log start 
> offsets and recovery offsets. Almost 90% of the time is there. See attached. 
> When a partition is deleted, the replica manager calls 
> `ReplicaManager#asyncDelete` that checkpoints recovery offsets and log start 
> offsets. As the checkpoints are per data directory, the checkpointing is made 
> for all the partitions in the directory of the partition to be deleted. In 
> our case where we have only one data directory, if you deletes 1000 
> partitions, we end up checkpointing the same things 1000 times which is not 
> efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >