[jira] [Created] (KAFKA-15808) kstreams app consumer threads all fenced due to outdated static membership ids during group coordinator migration

2023-11-10 Thread Carrie Hazelton (Jira)
Carrie Hazelton created KAFKA-15808:
---

 Summary: kstreams app consumer threads all fenced due to outdated 
static membership ids during group coordinator migration
 Key: KAFKA-15808
 URL: https://issues.apache.org/jira/browse/KAFKA-15808
 Project: Kafka
  Issue Type: Bug
  Components: group-coordinator, streams
Affects Versions: 3.5.0
Reporter: Carrie Hazelton
 Attachments: broker-50003-kafka.log, broker-55002-kafka.log, 
kstreams-55002-kafka.log

I am wondering if my team encountered a bug with static membership in a KStream 
application (river.stream-filter in attached logs). There was an abrupt 
shutdown of all of it's consumer threads after they all became fenced on 
membership id conflicts. It occurred during a rolling deployment to the Kafka 
broker fleet which triggered a migration of the group coordinator to a new 
broker (from kafka-broker-55002 to kafka-broker-50003).

My understanding is that a new group coordinator shouldn't start interacting 
with consumer threads until it has finished loading membership metadata during 
the migration. In this case, the new group coordinator started interacting with 
the thread consumers before it finished loading the latest membership 
information in the most recent generation of the membership metadata. As the 
migration was incomplete, the new group coordinator had outdated consumers 
membership IDs, which caused all the streaming app consumer threads to get 
fenced due to conflicting membership ids.

Major timeline in the attached logs:
 * 07:27:04 - 07:32:25 UTC: Deployment to broker kafka-broker-55002.
 * 07:32:07 - 07:32:10 UTC: 36 unavailable broker messages seen for 
kafka-broker-55002 and 3 for kafka-broker-50003 (overlapping in time).
 * 07:32:21,505 UTC: Controller kafka-broker-51005 adds broker 
kafka-broker-55002 back as a live broker.
 * 07:32:23,276 UTC: kafka-broker-50003 last mention of loading membership data.
 * 07:32:34,175 - 07:32:34,278 UTC - kstream app threads (consumers in group) 
using latest member IDs interact with group coordinator kafka-broker-50003 
which had outdated member IDs. Consumers threads from kstream app 
(river.stream-filter) are fenced and ejected from the consumer group.

When kafka-broker-50003 was assigned group coordinator it was loading the 
latest (737) generation membership metadata, but appears to have never 
finished. Review of a previous successful migration shows multiple log lines 
for a new generation, but only one was seen in this deployment. Below are some 
log lines that I found from a kstream host kafka.log that highlights the member 
id difference in for generation 737:

First generation and last seen log line on while the app was talking to group 
coordinator kafka-broker-50003:
{quote}
[2023-08-04 07:32:16,564] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=30, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 686. (kafka.coordinator.group.GroupMetadata$) 

{{{}{{}}{}}}{{{}{{...}}{}}}

[2023-08-04 07:32:23,276] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=30, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 737. (kafka.coordinator.group.GroupMetadata$)
{quote}
The memberId was the same in both generations (ending with {{615b538a0e79)}}

The consumers got fenced after the controller sent a message around the time of 
the above log line.

Later kafka-broker-55002 gets the group coordinator assignment back. These log 
lines look similar to a previous successful migration, and the memberId changes 
(from ending with {{615b538a0e79}} to {{{}4ba4c7f17eea{}}})

 
{quote}{{[2023-08-04 07:32:44,388] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=30, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 686. (kafka.coordinator.group.GroupMetadata$)}}

{{...}}

{{[2023-08-04 07:33:08,824] INFO Loaded member 

[DISCUSS] KIP-1004: Enforce tasks.max property in Kafka Connect

2023-11-10 Thread Chris Egerton
Hi all,

I'd like to open up KIP-1004 for discussion:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1004%3A+Enforce+tasks.max+property+in+Kafka+Connect

As a brief summary: this KIP proposes that the Kafka Connect runtime start
failing connectors that generate a greater number of tasks than the
tasks.max property, with an optional emergency override that can be used to
continue running these (probably-buggy) connectors if absolutely necessary.

I'll be taking time off most of the next three weeks, so response latency
may be a bit higher than usual, but I wanted to kick off the discussion in
case we can land this in time for the upcoming 3.7.0 release.

Cheers,

Chris


Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #110

2023-11-10 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 309263 lines...]

Gradle Test Run :core:test > Gradle Test Executor 16 > 
LogCleanerIntegrationTest > testMaxLogCompactionLag() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 25 > 
ConsistencyVectorIntegrationTest > shouldHaveSamePositionBoundActiveAndStandBy 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 25 > 
GlobalKTableEOSIntegrationTest > [exactly_once] > 
shouldKStreamGlobalKTableLeftJoin[exactly_once] STARTED

Gradle Test Run :core:test > Gradle Test Executor 16 > 
LogCleanerIntegrationTest > testMaxLogCompactionLag() PASSED

> Task :tools:test

Gradle Test Run :tools:test > Gradle Test Executor 24 > ConnectPluginPathTest > 
testListOnePluginPath(PluginLocationType) > [1] CLASS_HIERARCHY PASSED

Gradle Test Run :tools:test > Gradle Test Executor 24 > ConnectPluginPathTest > 
testListOnePluginPath(PluginLocationType) > [2] SINGLE_JAR STARTED

Gradle Test Run :tools:test > Gradle Test Executor 24 > ConnectPluginPathTest > 
testListOnePluginPath(PluginLocationType) > [2] SINGLE_JAR PASSED

Gradle Test Run :tools:test > Gradle Test Executor 24 > ConnectPluginPathTest > 
testListOnePluginPath(PluginLocationType) > [3] MULTI_JAR STARTED

Gradle Test Run :core:test > Gradle Test Executor 16 > 
LogCleanerLagIntegrationTest > cleanerTest(CompressionType) > [1] codec=none 
STARTED

> Task :tools:test

Gradle Test Run :tools:test > Gradle Test Executor 26 > 
DeleteRecordsCommandTest > testCommand() > [1] Type=Raft-Isolated, 
Name=testCommand, MetadataVersion=3.6-IV2, Security=PLAINTEXT STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > 
DeleteRecordsCommandTest > testCommand() > [1] Type=Raft-Isolated, 
Name=testCommand, MetadataVersion=3.6-IV2, Security=PLAINTEXT PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > 
DeleteRecordsCommandTest > testCommand() > [2] Type=Raft-Combined, 
Name=testCommand, MetadataVersion=3.6-IV2, Security=PLAINTEXT STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > 
DeleteRecordsCommandTest > testCommand() > [2] Type=Raft-Combined, 
Name=testCommand, MetadataVersion=3.6-IV2, Security=PLAINTEXT PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > 
DeleteRecordsCommandTest > testCommand() > [3] Type=ZK, Name=testCommand, 
MetadataVersion=3.6-IV2, Security=PLAINTEXT STARTED

Gradle Test Run :tools:test > Gradle Test Executor 25 > ConnectPluginPathTest > 
testSyncManifestsDryRun(PluginLocationType) > [2] SINGLE_JAR PASSED

Gradle Test Run :tools:test > Gradle Test Executor 25 > ConnectPluginPathTest > 
testSyncManifestsDryRun(PluginLocationType) > [3] MULTI_JAR STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > 
DeleteRecordsCommandTest > testCommand() > [3] Type=ZK, Name=testCommand, 
MetadataVersion=3.6-IV2, Security=PLAINTEXT PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenSentIsNotEqualToReceived() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 24 > ConnectPluginPathTest > 
testListOnePluginPath(PluginLocationType) > [3] MULTI_JAR PASSED

Gradle Test Run :tools:test > Gradle Test Executor 24 > ConnectPluginPathTest > 
testListMultiplePluginPaths(PluginLocationType) > [1] CLASS_HIERARCHY STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenSentIsNotEqualToReceived() PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenConsumerRecordsIsEmpty() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenConsumerRecordsIsEmpty() PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenReceivedMoreThanOneRecord() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenReceivedMoreThanOneRecord() PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenProducerAcksAreNotSynchronised() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenProducerAcksAreNotSynchronised() PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldPassInValidation() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldPassInValidation() PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenSuppliedUnexpectedArgs() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > EndToEndLatencyTest > 
shouldFailWhenSuppliedUnexpectedArgs() PASSED

Gradle Test Run :tools:test > Gradle Test Executor 26 > FeatureCommandUnitTest 
> testHandleDowngrade() STARTED

Gradle Test Run :tools:test > Gradle Test Executor 26 > FeatureCommandUnitTest 
> testHandleDowngrade() 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2376

2023-11-10 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 216969 lines...]
> Task :connect:api:copyDependantLibs UP-TO-DATE
> Task :connect:api:jar UP-TO-DATE
> Task :connect:json:compileTestJava UP-TO-DATE
> Task :connect:api:generateMetadataFileForMavenJavaPublication
> Task :connect:json:copyDependantLibs UP-TO-DATE
> Task :connect:api:compileTestJava UP-TO-DATE
> Task :connect:json:jar UP-TO-DATE
> Task :connect:api:testClasses UP-TO-DATE
> Task :connect:json:generateMetadataFileForMavenJavaPublication
> Task :connect:json:testClasses UP-TO-DATE
> Task :connect:json:testJar
> Task :connect:json:testSrcJar
> Task :connect:api:testJar
> Task :connect:api:testSrcJar
> Task :connect:api:publishMavenJavaPublicationToMavenLocal
> Task :connect:api:publishToMavenLocal
> Task :connect:json:publishMavenJavaPublicationToMavenLocal
> Task :connect:json:publishToMavenLocal
> Task :storage:storage-api:compileTestJava
> Task :storage:storage-api:testClasses
> Task :server-common:compileTestJava
> Task :server-common:testClasses
> Task :raft:compileTestJava
> Task :raft:testClasses
> Task :group-coordinator:compileTestJava
> Task :group-coordinator:testClasses

> Task :clients:javadoc
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk/clients/src/main/java/org/apache/kafka/clients/admin/ScramMechanism.java:32:
 warning - Tag @see: missing final '>': "https://cwiki.apache.org/confluence/display/KAFKA/KIP-554%3A+Add+Broker-side+SCRAM+Config+API;>KIP-554:
 Add Broker-side SCRAM Config API

 This code is duplicated in 
org.apache.kafka.common.security.scram.internals.ScramMechanism.
 The type field in both files must match and must not change. The type field
 is used both for passing ScramCredentialUpsertion and for the internal
 UserScramCredentialRecord. Do not change the type field."
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/secured/package-info.java:21:
 warning - Tag @link: reference not found: 
org.apache.kafka.common.security.oauthbearer
2 warnings

> Task :clients:javadocJar
> Task :metadata:compileTestJava
> Task :metadata:testClasses
> Task :clients:srcJar
> Task :clients:testJar
> Task :clients:testSrcJar
> Task :clients:publishMavenJavaPublicationToMavenLocal
> Task :clients:publishToMavenLocal
> Task :core:classes
> Task :core:compileTestJava NO-SOURCE
> Task :core:compileTestScala
> Task :core:testClasses
> Task :streams:compileTestJava
> Task :streams:testClasses
> Task :streams:testJar
> Task :streams:testSrcJar
> Task :streams:publishMavenJavaPublicationToMavenLocal
> Task :streams:publishToMavenLocal

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.3/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 8m 5s
93 actionable tasks: 40 executed, 53 up-to-date

Publishing build scan...
https://ge.apache.org/s/nud6uwb6c3ary

[Pipeline] sh
+ grep ^version= gradle.properties
+ cut -d= -f 2
[Pipeline] dir
Running in 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk/streams/quickstart
[Pipeline] {
[Pipeline] sh
+ mvn clean install -Dgpg.skip
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Kafka Streams :: Quickstart[pom]
[INFO] streams-quickstart-java[maven-archetype]
[INFO] 
[INFO] < org.apache.kafka:streams-quickstart >-
[INFO] Building Kafka Streams :: Quickstart 3.7.0-SNAPSHOT[1/2]
[INFO]   from pom.xml
[INFO] [ pom ]-
[INFO] 
[INFO] --- clean:3.0.0:clean (default-clean) @ streams-quickstart ---
[INFO] 
[INFO] --- remote-resources:1.5:process (process-resource-bundles) @ 
streams-quickstart ---
[INFO] 
[INFO] --- site:3.5.1:attach-descriptor (attach-descriptor) @ 
streams-quickstart ---
[INFO] 
[INFO] --- gpg:1.6:sign (sign-artifacts) @ streams-quickstart ---
[INFO] 
[INFO] --- install:2.5.2:install (default-install) @ streams-quickstart ---
[INFO] Installing 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk/streams/quickstart/pom.xml
 to 
/home/jenkins/.m2/repository/org/apache/kafka/streams-quickstart/3.7.0-SNAPSHOT/streams-quickstart-3.7.0-SNAPSHOT.pom
[INFO] 
[INFO] --< org.apache.kafka:streams-quickstart-java >--
[INFO] Building streams-quickstart-java 3.7.0-SNAPSHOT[2/2]
[INFO]   from java/pom.xml
[INFO] --[ maven-archetype ]---

Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.5 #97

2023-11-10 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-15807) Add support for compress/decompress metrics

2023-11-10 Thread Apoorv Mittal (Jira)
Apoorv Mittal created KAFKA-15807:
-

 Summary: Add support for compress/decompress metrics
 Key: KAFKA-15807
 URL: https://issues.apache.org/jira/browse/KAFKA-15807
 Project: Kafka
  Issue Type: Sub-task
Reporter: Apoorv Mittal
Assignee: Apoorv Mittal






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Apache Kafka 3.5.2 release

2023-11-10 Thread Greg Harris
Hey Luke,

I merged and backported KAFKA-15800 to 3.5 and 3.6, sorry for the delay.

Thanks for running the release!
Greg

On Thu, Nov 9, 2023 at 6:54 PM Luke Chen  wrote:
>
> Hi all,
>
> Greg found a regression issue in Kafka connect:
> https://issues.apache.org/jira/browse/KAFKA-15800
> I'll wait until this fix gets merged and create CR build for v3.5.2.
>
> Thanks.
> Luke
>
> On Sat, Nov 4, 2023 at 1:33 AM Matthias J. Sax  wrote:
>
> > Hey,
> >
> > Sorry for late reply. We finished our testing, and think we are go.
> >
> > Thanks for giving us the opportunity to get the RocksDB version bump in.
> > Let's ship it!
> >
> >
> > -Matthias
> >
> > On 11/2/23 4:37 PM, Luke Chen wrote:
> > > Hi Matthias,
> > >
> > > Is there any update about the test status for RocksDB versions bumps?
> > > Could I create a 3.5.2 RC build next week?
> > >
> > > Thanks.
> > > Luke
> > >
> > > On Sat, Oct 21, 2023 at 1:01 PM Luke Chen  wrote:
> > >
> > >> Hi Matthias,
> > >>
> > >> I agree it's indeed a blocker for 3.5.2 to address CVE in RocksDB.
> > >> Please let me know when the test is completed.
> > >>
> > >> Thank you.
> > >> Luke
> > >>
> > >> On Sat, Oct 21, 2023 at 2:12 AM Matthias J. Sax 
> > wrote:
> > >>
> > >>> Thanks for the info Luke.
> > >>>
> > >>> We did backport all but one PR in the mean time. The missing PR is a
> > >>> RocksDB version bump. We want to consider it for 3.5.2, because it
> > >>> addresses a CVE.
> > >>>
> > >>> Cf https://github.com/apache/kafka/pull/14216
> > >>>
> > >>> However, RocksDB versions bumps are a little bit more tricky, and we
> > >>> would like to test this properly on 3.5 branch, what would take at
> > least
> > >>> one week; we could do the cherry-pick on Monday and start testing.
> > >>>
> > >>> Please let us know if such a delay for 3.5.2 is acceptable or not.
> > >>>
> > >>> Thanks.
> > >>>
> > >>> -Matthias
> > >>>
> > >>>
> > >>> On 10/20/23 5:44 AM, Luke Chen wrote:
> >  Hi Ryan,
> > 
> >  OK, I've backported it to 3.5 branch.
> >  I'll be included in v3.5.2.
> > 
> >  Thanks.
> >  Luke
> > 
> >  On Fri, Oct 20, 2023 at 7:43 AM Ryan Leslie (BLP/ NEW YORK (REMOT) <
> >  rles...@bloomberg.net> wrote:
> > 
> > > Hi Luke,
> > >
> > > Hope you are well. Can you please include
> > > https://issues.apache.org/jira/browse/KAFKA-15106 in 3.5.2?
> > >
> > > Thanks,
> > >
> > > Ryan
> > >
> > > From: dev@kafka.apache.org At: 10/17/23 05:05:24 UTC-4:00
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] Apache Kafka 3.5.2 release
> > >
> > > Thanks Luke for volunteering for 3.5.2 release.
> > >
> > > On Tue, 17 Oct 2023 at 11:58, Josep Prat  > >
> > > wrote:
> > >>
> > >> Hi Luke,
> > >>
> > >> Thanks for taking this one!
> > >>
> > >> Best,
> > >>
> > >> On Tue, Oct 17, 2023 at 8:12 AM Luke Chen 
> > wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> I'd like to volunteer as release manager for the Apache Kafka
> > 3.5.2,
> > >>> to
> > >>> have an important bug/vulnerability fix release for 3.5.1.
> > >>>
> > >>> If there are no objections, I'll start building a release plan in
> > > thewiki
> > >>> in the next couple of weeks.
> > >>>
> > >>> Thanks,
> > >>> Luke
> > >>>
> > >>
> > >>
> > >> --
> > >> [image: Aiven] 
> > >>
> > >> *Josep Prat*
> > >> Open Source Engineering Director, *Aiven*
> > >> josep.p...@aiven.io | +491715557497
> > >> aiven.io  | <
> > >>> https://www.facebook.com/aivencloud>
> > >>  <
> > >>> https://twitter.com/aiven_io>
> > >> *Aiven Deutschland GmbH*
> > >> Alexanderufer 3-7, 10117 Berlin
> > >> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> > >> Amtsgericht Charlottenburg, HRB 209739 B
> > >
> > >
> > >
> > 
> > >>>
> > >>
> > >
> >


[DISCUSS] KIP-1003: Signal next segment when remote fetching

2023-11-10 Thread Jorge Esteban Quilcate Otoya
Hi there,

I would like to start the discussion on a KIP for Tiered Storage. It's
about improving cross-segment latencies by enabling Remote Storage Manager
implementation to pre-fetch across segments.
Have a look:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1003%3A+Signal+next+segment+when+remote+fetching

Cheers,
Jorge


[DISCUSS] KIP-1002: Fetch remote segment indexes at once

2023-11-10 Thread Jorge Esteban Quilcate Otoya
Hello everyone,

I would like to start the discussion on a KIP for Tiered Storage. It's
about improving cross-segment latencies by reducing calls to fetch indexes
individually.
Have a look:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1002%3A+Fetch+remote+segment+indexes+at+once

Cheers,
Jorge


[jira] [Created] (KAFKA-15806) Signal next segment when remote fetching

2023-11-10 Thread Jorge Esteban Quilcate Otoya (Jira)
Jorge Esteban Quilcate Otoya created KAFKA-15806:


 Summary: Signal next segment when remote fetching 
 Key: KAFKA-15806
 URL: https://issues.apache.org/jira/browse/KAFKA-15806
 Project: Kafka
  Issue Type: Improvement
  Components: Tiered-Storage
Reporter: Jorge Esteban Quilcate Otoya
Assignee: Jorge Esteban Quilcate Otoya


Improve remote fetching performance when fetching across segment by signaling 
the next segment and allow Remote Storage Manager implementations to optimize 
their pre-fetching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15805) Fetch Remote Indexes at once

2023-11-10 Thread Jorge Esteban Quilcate Otoya (Jira)
Jorge Esteban Quilcate Otoya created KAFKA-15805:


 Summary: Fetch Remote Indexes at once
 Key: KAFKA-15805
 URL: https://issues.apache.org/jira/browse/KAFKA-15805
 Project: Kafka
  Issue Type: Improvement
  Components: Tiered-Storage
Reporter: Jorge Esteban Quilcate Otoya
Assignee: Jorge Esteban Quilcate Otoya


Reduce Tiered Storage latency when fetching indexes by allowing to fetch many 
indexes at once.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2375

2023-11-10 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-15745) KRaft support in RequestQuotaTest

2023-11-10 Thread Zihao Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zihao Lin resolved KAFKA-15745.
---
Resolution: Duplicate

> KRaft support in RequestQuotaTest
> -
>
> Key: KAFKA-15745
> URL: https://issues.apache.org/jira/browse/KAFKA-15745
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Sameer Tejani
>Priority: Minor
>  Labels: kraft, kraft-test, newbie
>
> The following tests in RequestQuotaTest in 
> core/src/test/scala/unit/kafka/server/RequestQuotaTest.scala need to be 
> updated to support KRaft
> 132 : def testResponseThrottleTime(): Unit = {
> 140 : def testResponseThrottleTimeWhenBothProduceAndRequestQuotasViolated(): 
> Unit = {
> 146 : def testResponseThrottleTimeWhenBothFetchAndRequestQuotasViolated(): 
> Unit = {
> 152 : def testUnthrottledClient(): Unit = {
> 161 : def testExemptRequestTime(): Unit = {
> 171 : def testUnauthorizedThrottle(): Unit = {
> Scanned 801 lines. Found 0 KRaft tests out of 6 tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

2023-11-10 Thread Qichao Chu
Thank you again for the nice suggestions, Jorge!
I will wait for Divij's response and move it to the vote stage once the
generic filter part reached concensus.

Qichao Chu
Software Engineer | Data - Kafka
[image: Uber] 


On Fri, Nov 10, 2023 at 6:49 AM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Hi Qichao,
>
> Thanks for updating the KIP, all updates look good to me.
>
> Looking forward to see this KIP moving forward!
>
> Cheers,
> Jorge.
>
>
>
> On Wed, 8 Nov 2023 at 08:55, Qichao Chu  wrote:
>
> > Hi Divij,
> >
> > Thank you for the feedback. I updated the KIP to make it a little bit
> more
> > generic: filters will stay in an array instead of different top-level
> > objects. In this way, if we need language filters in the future. The
> logic
> > relationship of filters is also added.
> >
> > Hi Jorge,
> >
> > Thank you for the review and great comments. Here is the reply for each
> of
> > the suggestions:
> >
> > 1) The words describing the property are now updated to include more
> > details of the keys in the JSON. It also explicitly mentions the JSON
> > nature of the config now.
> > 2) The JSON entries should be non-conflict so the order is not relevant.
> If
> > there's conflict, the conflict resolution rules are stated in the KIP. To
> > make it more clear, ordering and duplication rules are updated in the
> > Restrictions section of the *level* property.
> > 3) Yeah we did take a look at the RecordingLevel config and it does not
> > work for this case. The RecodingLevel config does not offer the
> capability
> > of filtering and it has a drawback of needing to be added to all the
> future
> > sensors. To reduce the duplication, I propose we merge the RecordingLevel
> > to this more generic config in the future. Please take a look into the
> > *Using
> > the Existing RecordingLevel Config* section under *Rejected Alternatives*
> > for more details.
> > 4) This suggestion makes a lot of sense. My idea is to create a
> > table/form/doc in the documentation for the verbosity levels of all
> metric
> > series. If it's too verbose to be in the docs, I will update the KIP to
> > include this info. I will create a JIRA for this effort once the KIP is
> > approved.
> > 5) Sure we can expand to all other series, added to the KIP.
> > 6) Added a new section(*Working with the Configuration via CLI)* with the
> > user experience details
> > 7) Links are updated.
> >
> > Please take another look and let me know if you have any more concerns.
> >
> > Best,
> > Qichao Chu
> > Software Engineer | Data - Kafka
> > [image: Uber] 
> >
> >
> > On Wed, Nov 8, 2023 at 6:29 AM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Hi Qichao,
> > >
> > > Thanks for the KIP! This will be a valuable contribution and improve
> the
> > > tooling for troubleshooting.
> > >
> > > I have a couple of comments:
> > >
> > > 1. It's unclear from the `metrics.verbosity` description what the
> > supported
> > > values are. In the description mentions "If the value is high ... In
> the
> > > low settings" but I think it's referring to the `level` property
> > > specifically instead of the whole value that is now JSON. Could you
> > clarify
> > > this?
> > >
> > > 2. Could we state in which order the JSON entries are going to be
> > > evaluated? I guess the last entry wins if it overlaps previous values,
> > but
> > > better to make this explicit.
> > >
> > > 3. Kafka metrics library has a `RecordingLevel` configuration -- have
> we
> > > considered aligning these concepts and maybe reuse it instead of
> > > `verbosityLevel`? Then we can reuse the levels: INFO, DEBUG, TRACE.
> > >
> > > 4. Not sure if within the scope of the KIP, but would be helpful to
> > > document the metrics with the verbosity level attached to the metrics.
> > > Maybe creating a JIRA ticket to track this would be enough if we can't
> > > cover it as part of the KIP.
> > >
> > > 5. Could we consider the following client-related metrics as well:
> > >   - BytesRejectedPerSec
> > >   - TotalProduceRequestsPerSec
> > >   - TotalFetchRequestsPerSec
> > >   - FailedProduceRequestsPerSec
> > >   - FailedFetchRequestsPerSec
> > >   - FetchMessageConversionsPerSec
> > >   - ProduceMessageConversionsPerSec
> > > Would be great to have these from day 1 instead of requiring a
> following
> > > KIP to extend this. Could be implemented in separate PRs if needed.
> > >
> > > 6. To make it clearer how the user experience would be, could we
> provide
> > an
> > > example of:
> > > - how the broker configuration will be provided by default, and
> > > - how the CLI tooling would be used to change the configuration?
> > > - Maybe a couple of scenarios: adding a new metric config, a second one
> > > with overlapping values, and
> > > - describing the expected metrics to be mapped
> > >
> > > A couple of nits:
> > > - The first link "MessagesInPerSec metrics" is pointing to
> > > 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2374

2023-11-10 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 326577 lines...]

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldSetUncaughtStreamsException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldClearTaskTimeoutOnProcessed() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldClearTaskTimeoutOnProcessed() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldClearTaskReleaseFutureOnShutdown() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldClearTaskReleaseFutureOnShutdown() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldProcessTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldProcessTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldPunctuateStreamTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldPunctuateStreamTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldShutdownTaskExecutor() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldShutdownTaskExecutor() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldAwaitProcessableTasksIfNoneAssignable() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldAwaitProcessableTasksIfNoneAssignable() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenNotProgressing() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenNotProgressing() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldNotFlushOnException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > shouldNotFlushOnException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectProcessingDisabledByTaskExecutionMetadata() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskExecutorTest > 
shouldRespectProcessingDisabledByTaskExecutionMetadata() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldLockAnEmptySetOfTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldLockAnEmptySetOfTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldAssignTasksThatCanBeSystemTimePunctuated() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldAssignTasksThatCanBeSystemTimePunctuated() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotUnassignNotOwnedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotUnassignNotOwnedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotSetUncaughtExceptionsTwice() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldNotSetUncaughtExceptionsTwice() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnAdding() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 88 > 
DefaultTaskManagerTest > shouldReturnFromAwaitOnAdding() PASSED


Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

2023-11-10 Thread Satish Duggana
Thanks Christo for the KIP and the interesting discussion.

101. Adding metrics at partition level may increase the cardinality of
these metrics. We should be cautious of that and see whether they are
really needed. RLM related operations do not generally affect based on
partition(s) but it is mostly because of the remote storage or broker
level issues.

102. I am not sure whether the records metric is much useful when we
have other bytes and segments related metrics available. If needed,
records level information can be derived once we have segments/bytes
metrics.

103. Regarding RemoteLogSizeComputationTime, we can add logs for
debugging purposes to print the required duration while computing size
instead of generating a metric. If there is any degradation in remote
log size computation, it will have an effect on RLM task leading to
remote log copy and delete lags.

RLMM and RSM implementations can always add more metrics for
observability based on the respective implementations.

104. What is the purpose of RemoteLogMetadataCount as a metric?

Thanks,
Satish.

On Fri, 10 Nov 2023 at 04:10, Jorge Esteban Quilcate Otoya
 wrote:
>
> Hi Christo,
>
> I'd like to add another suggestion:
>
> 7. Adding on TS lag formulas, my understanding is that per pertition:
> - RemoteCopyLag: difference between: latest local segment candidate for
> upload - latest remote segment
>   - Represents how Remote Log Manager task is handling backlog of segments.
>   - Ideally, this lag is zero -- grows when upload is slower than the
> increase on candidate segments to upload
>
> - RemoteDeleteLag: difference between: latest remote candidate segment to
> keep based on retention - oldest remote segment
>   - Represents how many segments Remote Log Manager task is missing to
> delete at a given point in time
>   - Ideally, this lag is zero -- grows when retention condition changes but
> RLM task is not able to schedule deletion yet.
>
> Is my understanding of these lags correct?
>
> I'd like to also consider an additional lag:
> - LocalDeleteLag: difference between: latest local candidate segment to
> keep based on local retention - oldest local segment
>   - Represents how many segments are still available locally when they are
> candidate for deletion. This usually happens when log cleaner has not been
> scheduled yet. It's important because it represents how much data is stored
> locally when it could be removed, and it represents how much data that can
> be sourced from remote tier is still been sourced from local tier.
>   - Ideally, this lag returns to zero when log cleaner runs; but could be
> growing if there are issues uploading data (other lag) or removing data
> internally.
>
> Thanks,
> Jorge.
>
> On Thu, 9 Nov 2023 at 10:51, Luke Chen  wrote:
>
> > Hi Christo,
> >
> > Thanks for the KIP!
> >
> > Some comments:
> > 1. I agree with Kamal that a metric to cover the time taken to read data
> > from remote storage is helpful.
> >
> > 2. I can see there are some metrics are only on topic level, but some are
> > on partition level.
> > Could you explain why some of them are only on topic level?
> > Like RemoteLogSizeComputationTime, it's different from partition to
> > partition, will it be better to be exposed as partition metric?
> >
> > 3. `RemoteLogSizeBytes` metric hanging.
> > To compute the RemoteLogSizeBytes, we need to wait until all records in the
> > metadata topic loaded.
> > What will happen if it takes long to load the data from metadata topic?
> > Should we instead return -1 or something to indicate it's still loading?
> >
> > Thanks.
> > Luke
> >
> > On Fri, Nov 3, 2023 at 1:53 AM Kamal Chandraprakash <
> > kamal.chandraprak...@gmail.com> wrote:
> >
> > > Hi Christo,
> > >
> > > Thanks for expanding the scope of the KIP!  We should also cover the time
> > > taken to
> > > read data from remote storage. This will give our users a fair idea about
> > > the P99, P95,
> > > and P50 Fetch latency to read data from remote storage.
> > >
> > > The Fetch API request metrics currently provides a breakdown of the time
> > > spent on each item:
> > >
> > >
> > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L517
> > > Should we also provide `RemoteStorageTimeMs` item (only for FETCH API) so
> > > that users can
> > > understand the overall and per-step time taken?
> > >
> > > Regarding the Remote deletion metrics, should we also emit a metric to
> > > expose the oldest segment time?
> > > Users can configure the topic retention either by size (or) time. If time
> > > is configured, then emitting
> > > the oldest segment time allows the user to configure an alert on top of
> > it
> > > and act accordingly.
> > >
> > > On Wed, Nov 1, 2023 at 7:07 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jo...@gmail.com> wrote:
> > >
> > > > Thanks, Christo!
> > > >
> > > > 1. Agree. Having a further look into how many latency metrics are
> > > included
> > > > on the broker side 

[PR] Update Satish added as a PMC member [kafka-site]

2023-11-10 Thread via GitHub


satishd opened a new pull request, #566:
URL: https://github.com/apache/kafka-site/pull/566

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org