date:20210903

[GitHub] [kafka] mimaison commented on pull request #11220: KAFKA-10777: Add additional configuration to control MirrorMaker 2 internal topics naming convention

2021-09-03 Thread GitBox



mimaison commented on pull request #11220:
URL: https://github.com/apache/kafka/pull/11220#issuecomment-912351237


   Thanks @OmniaGM for the PR. I'll try to take a look next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] mimaison commented on a change in pull request #11220: KAFKA-10777: Add additional configuration to control MirrorMaker 2 internal topics naming convention

2021-09-03 Thread GitBox



mimaison commented on a change in pull request #11220:
URL: https://github.com/apache/kafka/pull/11220#discussion_r701694521



##
File path: 
connect/mirror-client/src/main/java/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.java
##
@@ -70,4 +70,32 @@ public String upstreamTopic(String topic) {
 return topic.substring(source.length() + separator.length());
 }
 }
+
+private String internalSuffix() {
+return separator + "internal";
+}
+
+private String checkpointTopicSufficx() {

Review comment:
   `checkpointTopicSufficx` -> `checkpointsTopicSuffix`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] mimaison commented on a change in pull request #11288: MINOR: Fix error response generation

2021-09-03 Thread GitBox



mimaison commented on a change in pull request #11288:
URL: https://github.com/apache/kafka/pull/11288#discussion_r701702133



##
File path: 
clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java
##
@@ -706,6 +706,8 @@ private void checkDescribeConfigsResponseVersions() {
 private void checkErrorResponse(AbstractRequest req, Throwable e, boolean 
checkEqualityAndHashCode) {
 AbstractResponse response = req.getErrorResponse(e);
 checkResponse(response, req.version(), checkEqualityAndHashCode);
+Map errorCounts = response.errorCounts();
+assertTrue(errorCounts.containsKey(Errors.forException(e)), "API Key " 
+ req.apiKey().name + "V" + req.version() + " failed errorCounts test");

Review comment:
   @hachikuji Thanks for the review! I pushed an update




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-13247) Adding functionality for loading private key entry by alias from the keystore

2021-09-03 Thread Nikolay Izhikov (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Izhikov updated KAFKA-13247:

Labels: kip-required  (was: )

> Adding functionality for loading private key entry by alias from the keystore
> -
>
> Key: KAFKA-13247
> URL: https://issues.apache.org/jira/browse/KAFKA-13247
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Tigran Margaryan
>Priority: Major
>  Labels: kip-required
>
> Hello team,
> While configuring SSL for Kafka connectivity , I found out that there is no 
> possibility to choose/load the private key entry by alias from the keystore 
> defined via 
> org.apache.kafka.common.config.SslConfigs.SSL_KEYSTORE_LOCATION_CONFIG. It 
> turns out that the keystore could not have multiple private key entries .
> Kindly ask you to add that config (smth. like SSL_KEY_ALIAS_CONFIG) into 
> SslConfigs with the corresponding functionality which should load only the 
> private key entry by defined alias.
>  
> Thanks in advance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-13243) Differentiate metric latency measured in millis and nanos

2021-09-03 Thread Josep Prat (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josep Prat reassigned KAFKA-13243:
--

Assignee: Josep Prat

> Differentiate metric latency measured in millis and nanos
> -
>
> Key: KAFKA-13243
> URL: https://issues.apache.org/jira/browse/KAFKA-13243
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Guozhang Wang
>Assignee: Josep Prat
>Priority: Major
>  Labels: needs-kip
>
> Today most of the client latency metrics are measured in millis, and some in 
> nanos. For those measured in nanos we usually differentiate them by having a 
> `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and 
> `io-time-ns-avg`. But there are a few that we obviously forgot to follow this 
> pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` 
> suffix and `total` has not. I did a quick search and found just three of them:
> * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total
> * io-wait-time-total -> io-wait-time-ns-total
> * iotime-total -> io-time-ns-total (note that there are two inconsistencies 
> on naming, the average metric is `io-time-ns-avg` whereas total is 
> `iotime-total`, I suggest we use `io-time` instead of `iotime` for both).
> We should change their name accordingly with the `-ns` suffix as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos

2021-09-03 Thread Josep Prat (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409407#comment-17409407
 ] 

Josep Prat commented on KAFKA-13243:


I'll like to tackle this one

> Differentiate metric latency measured in millis and nanos
> -
>
> Key: KAFKA-13243
> URL: https://issues.apache.org/jira/browse/KAFKA-13243
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Guozhang Wang
>Assignee: Josep Prat
>Priority: Major
>  Labels: needs-kip
>
> Today most of the client latency metrics are measured in millis, and some in 
> nanos. For those measured in nanos we usually differentiate them by having a 
> `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and 
> `io-time-ns-avg`. But there are a few that we obviously forgot to follow this 
> pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` 
> suffix and `total` has not. I did a quick search and found just three of them:
> * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total
> * io-wait-time-total -> io-wait-time-ns-total
> * iotime-total -> io-time-ns-total (note that there are two inconsistencies 
> on naming, the average metric is `io-time-ns-avg` whereas total is 
> `iotime-total`, I suggest we use `io-time` instead of `iotime` for both).
> We should change their name accordingly with the `-ns` suffix as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [kafka] keepal7 commented on pull request #11280: MINOR:When rebalance times out, print the member's clientHost in the …

2021-09-03 Thread GitBox



keepal7 commented on pull request #11280:
URL: https://github.com/apache/kafka/pull/11280#issuecomment-912472160


   Who can tell me why the pipeline can't pass?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] dajac opened a new pull request #11294: KAFKA-13266; `InitialFetchState` should be created after partition is removed from the fetchers

2021-09-03 Thread GitBox



dajac opened a new pull request #11294:
URL: https://github.com/apache/kafka/pull/11294


    `ReplicationTest.test_replication_with_broker_failure` in KRaft mode 
sometimes fails with the following error in the log:
   
   ```
   [2021-08-31 11:31:25,092] ERROR [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=0] Unexpected error occurred while processing data for partition 
__consumer_offsets-1 at offset 31727 
(kafka.server.ReplicaFetcherThread)java.lang.IllegalStateException: Offset 
mismatch for partition __consumer_offsets-1: fetched offset = 31727, log end 
offset = 31728. at 
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:194)
 at 
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$8(AbstractFetcherThread.scala:545)
 at scala.Option.foreach(Option.scala:437) at 
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:533)
 at 
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7$adapted(AbstractFetcherThread.scala:532)
 at 
kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
 at 
scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scal
 a:359) at 
scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
 at 
scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
 at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:532)
 at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:216)
 at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:215)
 at scala.Option.foreach(Option.scala:437) at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:215) 
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:197) 
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:99)[2021-08-31 
11:31:25,093] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] 
Partition __consumer_offsets-1 marked as failed 
(kafka.server.ReplicaFetcherThread)
   ```
   
   The issue is due to a race condition in 
`ReplicaManager#applyLocalFollowersDelta`. The `InitialFetchState` is created 
and populated before the partition is removed from the fetcher threads. This 
means that the fetch offset of the `InitialFetchState` could be outdated when 
the fetcher threads are re-started because the fetcher threads could have 
incremented the log end offset in between.
   
   The patch fixes the issue by removing the partitions from the replica 
fetcher threads before creating the `InitialFetchState` for them.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] dajac commented on a change in pull request #11288: MINOR: Fix error response generation

2021-09-03 Thread GitBox



dajac commented on a change in pull request #11288:
URL: https://github.com/apache/kafka/pull/11288#discussion_r701855200



##
File path: 
clients/src/main/java/org/apache/kafka/common/requests/DescribeProducersRequest.java
##
@@ -69,7 +71,7 @@ public DescribeProducersRequestData data() {
 @Override
 public DescribeProducersResponse getErrorResponse(int throttleTimeMs, 
Throwable e) {
 Errors error = Errors.forException(e);
-DescribeProducersResponseData response = new 
DescribeProducersResponseData();
+List topics = new ArrayList<>();

Review comment:
   nit: We could keep instantiating `DescribeProducersResponseData` here 
and add the topic to it with `reponse.topics.add(..)`.

##
File path: 
clients/src/main/java/org/apache/kafka/common/requests/AlterClientQuotasRequest.java
##
@@ -125,7 +127,10 @@ public AlterClientQuotasResponse getErrorResponse(int 
throttleTimeMs, Throwable
 .setEntityType(entityData.entityType())
 .setEntityName(entityData.entityName()));
 }
-responseEntries.add(new 
AlterClientQuotasResponseData.EntryData().setEntity(responseEntities));
+responseEntries.add(new AlterClientQuotasResponseData.EntryData()
+.setEntity(responseEntities)
+.setErrorCode(error.code())
+.setErrorMessage(error.message()));

Review comment:
   nit: Should we use 4 spaces to indent those ones here? We use 4 spaces 
for similar code at L127/128 so the code would remain a bit more homogeneous. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] dajac commented on pull request #11086: KAFKA-13103: add REBALANCE_IN_PROGRESS error as retriable error

2021-09-03 Thread GitBox



dajac commented on pull request #11086:
URL: https://github.com/apache/kafka/pull/11086#issuecomment-912511738


   @showuon Yes. Could you update the description of the PR to reflect what has 
been done? The description is outdated. I will merge it after.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #11086: KAFKA-13103: add REBALANCE_IN_PROGRESS error as retriable error for AlterConsumerGroupOffsetsHandler

2021-09-03 Thread GitBox



showuon commented on pull request #11086:
URL: https://github.com/apache/kafka/pull/11086#issuecomment-912521901


   Thanks for reminding. Updated. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] mimaison commented on pull request #11288: MINOR: Fix error response generation

2021-09-03 Thread GitBox



mimaison commented on pull request #11288:
URL: https://github.com/apache/kafka/pull/11288#issuecomment-912527113


   Thanks @dajac for the feedback. I've addressed your comments.
   
   I agree with your comment about testing of these classes. 
`RequestResponseTest` is messy and it's hard to tell what's actually tested. I 
wouldn't be surprized if some requests or responses are completely missed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos

2021-09-03 Thread Josep Prat (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409511#comment-17409511
 ] 

Josep Prat commented on KAFKA-13243:


KIP can be found here 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-773%3A+Differentiate+consistently+metric+latency+measured+in+millis+and+nanos]

 

> Differentiate metric latency measured in millis and nanos
> -
>
> Key: KAFKA-13243
> URL: https://issues.apache.org/jira/browse/KAFKA-13243
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Guozhang Wang
>Assignee: Josep Prat
>Priority: Major
>  Labels: needs-kip
>
> Today most of the client latency metrics are measured in millis, and some in 
> nanos. For those measured in nanos we usually differentiate them by having a 
> `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and 
> `io-time-ns-avg`. But there are a few that we obviously forgot to follow this 
> pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` 
> suffix and `total` has not. I did a quick search and found just three of them:
> * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total
> * io-wait-time-total -> io-wait-time-ns-total
> * iotime-total -> io-time-ns-total (note that there are two inconsistencies 
> on naming, the average metric is `io-time-ns-avg` whereas total is 
> `iotime-total`, I suggest we use `io-time` instead of `iotime` for both).
> We should change their name accordingly with the `-ns` suffix as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-13271) Error while fetching metadata with correlation id 219783 : LEADER_NOT_AVAILABLE

2021-09-03 Thread Jira

Fátima Galera created KAFKA-13271:
-

 Summary: Error while fetching metadata with correlation id 219783 
: LEADER_NOT_AVAILABLE
 Key: KAFKA-13271
 URL: https://issues.apache.org/jira/browse/KAFKA-13271
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 2.2.2
 Environment: Production environment
Reporter: Fátima Galera
 Fix For: 2.2.3


Hi dear kafka support

 

We are getting below error after a new connector creation

 

[2021-09-02 19:15:23,878] WARN [Producer clientId=producer-178] Error while 
fetching metadata with correlation id 219783 : \{ 
tkr_prd2.tkr.glrep_file2=LEADER_NOT_AVAILABLE} 
(org.apache.kafka.clients.NetworkClient)
[2021-09-02 19:15:23,982] WARN [Producer clientId=producer-178] Error while 
fetching metadata with correlation id 219784 : \{ 
tkr_prd2.tkr.glrep_file2=LEADER_NOT_AVAILABLE} 
(org.apache.kafka.clients.NetworkClient)
[2021-09-02 19:15:24,095] ERROR WorkerSourceTask\{id=tkr_prd2-0} failed to send 
record to tkr_prd2.tkr.glrep_file2: {} (org.a 
pache.kafka.connect.runtime.WorkerSourceTask)
[2021-09-02 19:15:24,149] ERROR WorkerSourceTask\{id=tkr_prd2-0} failed to send 
record to tkr_prd2.tkr.glrep_file2: {} (org.a 
pache.kafka.connect.runtime.WorkerSourceTask)

 

We are not able to get the offset of this topic. We already changed from 

 {{listeners=PLAINTEXT://hostname:9092 to 
}}{{listeners=PLAINTEXT://localhost:9092}}{{}} 

in /etc/kafka/server.properties file and restarted kafka services. But if we 
use this new value kafka connector service is not able to start because it is 
not able to find the services with the IP of the host. Could you please let us 
know what can we do? We already created the connector several times using 
different names

 

https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [kafka] junrao commented on a change in pull request #10914: [KAFKA-8522] Streamline tombstone and transaction marker removal

2021-09-03 Thread GitBox



junrao commented on a change in pull request #10914:
URL: https://github.com/apache/kafka/pull/10914#discussion_r702018820



##
File path: core/src/main/scala/kafka/log/LogCleanerManager.scala
##
@@ -198,8 +199,23 @@ private[log] class LogCleanerManager(val logDirs: 
Seq[File],
   val cleanableLogs = dirtyLogs.filter { ltc =>
 (ltc.needCompactionNow && ltc.cleanableBytes > 0) || 
ltc.cleanableRatio > ltc.log.config.minCleanableRatio
   }
+
   if(cleanableLogs.isEmpty) {
-None
+val logsWithTombstonesExpired = dirtyLogs.filter {
+  case ltc => 
+// in this case, we are probably in a low throughput situation
+// therefore, we should take advantage of this fact and remove 
tombstones if we can
+// under the condition that the log's latest delete horizon is 
less than the current time
+// tracked
+ltc.log.latestDeleteHorizon != RecordBatch.NO_TIMESTAMP && 
ltc.log.latestDeleteHorizon <= time.milliseconds()

Review comment:
   Yes, ideally, we want to do size based estimate. I just not sure how 
accurate we can estimate size given batching and compression.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] hachikuji commented on a change in pull request #11288: MINOR: Fix error response generation

2021-09-03 Thread GitBox



hachikuji commented on a change in pull request #11288:
URL: https://github.com/apache/kafka/pull/11288#discussion_r702039368



##
File path: 
clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java
##
@@ -706,6 +706,11 @@ private void checkDescribeConfigsResponseVersions() {
 private void checkErrorResponse(AbstractRequest req, Throwable e, boolean 
checkEqualityAndHashCode) {
 AbstractResponse response = req.getErrorResponse(e);
 checkResponse(response, req.version(), checkEqualityAndHashCode);
+Errors error = Errors.forException(e);
+Map errorCounts = response.errorCounts();
+assertEquals(1, errorCounts.size());

Review comment:
   nit: could probably simplify these two assertions a little:
   ```java
   assertEquals(Collections.singleton(error), errorCounts.keySet());
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (KAFKA-13272) KStream offset stuck with exactly_once after brokers outage

2021-09-03 Thread Jira

F Méthot created KAFKA-13272:


 Summary: KStream offset stuck with exactly_once after brokers 
outage
 Key: KAFKA-13272
 URL: https://issues.apache.org/jira/browse/KAFKA-13272
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 2.8.0
 Environment: Kafka running on Kubernetes
centos
Reporter: F Méthot


Our KStream app offset stay stuck with exactly_once after outage.

Running with KStream 2.8, kafka broker 2.8,
3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3. 
Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning. 

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
After that, one of the partition can’t be restored. (like reported in previous 
attempt)
For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then

- we stop the kstream app,

- restarted kafka brokers cleanly

- Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
 Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
producer.d

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Summary: KStream offset stuck after brokers outage when exactly_once 
enabled  (was: KStream offset stuck with exactly_once enabled after brokers 
outage)

> KStream offset stuck after brokers outage when exactly_once enabled
> ---
>
> Key: KAFKA-13272
> URL: https://issues.apache.org/jira/browse/KAFKA-13272
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
> Environment: Kafka running on Kubernetes
> centos
>Reporter: F Méthot
>Priority: Major
>
> Our KStream app offset stay stuck with exactly_once after outage.
> Running with KStream 2.8, kafka broker 2.8,
> 3 brokers.
> commands topic is 10 partitions (replication 2, min-insync 2)
> command-expiry-store-changelog topic is 10 partitions (replication 2, 
> min-insync 2)
> events topic is 10 partitions (replication 2, min-insync 2)
> with this topology
> Topologies:
>  
> {code:java}
> Sub-topology: 0
>  Source: KSTREAM-SOURCE-00 (topics: [commands])
>  --> KSTREAM-TRANSFORM-01
>  Processor: KSTREAM-TRANSFORM-01 (stores: [])
>  --> KSTREAM-TRANSFORM-02
>  <-- KSTREAM-SOURCE-00
>  Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
>  --> KSTREAM-SINK-03
>  <-- KSTREAM-TRANSFORM-01
>  Sink: KSTREAM-SINK-03 (topic: events)
>  <-- KSTREAM-TRANSFORM-02
> {code}
> h3. 
> Attempt 1 at reproducing this issue
>  
> Our stream app runs with processing.guarantee *exactly_once* 
> After a Kafka test outage where all 3 brokers pod were deleted at the same 
> time,
> Brokers restarted and initialized succesfuly.
> When restarting the topology above, one of the tasks would never initialize 
> fully, the restore phase would keep outputting this messages every few 
> minutes:
>  
> {code:java}
> 2021-08-16 14:20:33,421 INFO stream-thread 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> Restoration in progress for 1 partitions. 
> {commands-processor-expiry-store-changelog-8: position=11775908, 
> end=11775911, totalRestored=2002076} 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> (org.apache.kafka.streams.processor.internals.StoreChangelogReader)
> {code}
> Task for partition 8 would never initialize, no more data would be read from 
> the source commands topic for that partition.
>  
> In an attempt to recover, we restarted the stream app with stream 
> processing.guarantee back to at_least_once, than it proceed with reading the 
> changelog and restoring partition 8 fully.
> But we noticed afterward, for the next hour until we rebuilt the system, that 
> partition 8 from command-expiry-store-changelog would not be 
> cleaned/compacted by the log cleaner/compacter compared to other partitions. 
> (could be unrelated, because we have seen that before)
> So we resorted to delete/recreate our command-expiry-store-changelog topic 
> and events topic and regenerate it from the commands, reading from beginning. 
> Things went back to normal
> h3. Attempt 2 at reproducing this issue
> We force-deleted all 3 pod running kafka.
> After that, one of the partition can’t be restored. (like reported in 
> previous attempt)
> For that partition, we noticed these logs on the broker
> {code:java}
> [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
> Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
> command-expiry-store-changelog-9) while trying to send transaction markers 
> for commands-processor-0_9, these partitions are likely deleted already and 
> hence can be skipped 
> (kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
> Then
> - we stop the kstream app,
> - restarted kafka brokers cleanly
> - Restarting the Kstream app, 
> Those logs messages showed up on the kstream app log:
>  
> {code:java}
> 2021-08-27 18:34:42,413 INFO [Consumer 
> clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
>  groupId=commands-processor] The following partitions still have unstable 
> offsets which are not cleared on the broker side: [commands-9], this could be 
> either transactional offsets waiting for completion, or normal offsets 
> waiting for replication after appending to local log 
> [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
>  
> {code}
> This would cause our processor to not consume from that specific source 
> topic-partition.
>  Deleting downstream topic and replaying data would NOT fix the issue 
> (EXACTLY_ONCE or AT_LEAST_ONCE)
> Workaround found:
>

[jira] [Updated] (KAFKA-13272) KStream offset stuck with exactly_once enabled after brokers outage

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Summary: KStream offset stuck with exactly_once enabled after brokers 
outage  (was: KStream offset stuck with exactly_once after brokers outage)

> KStream offset stuck with exactly_once enabled after brokers outage
> ---
>
> Key: KAFKA-13272
> URL: https://issues.apache.org/jira/browse/KAFKA-13272
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
> Environment: Kafka running on Kubernetes
> centos
>Reporter: F Méthot
>Priority: Major
>
> Our KStream app offset stay stuck with exactly_once after outage.
> Running with KStream 2.8, kafka broker 2.8,
> 3 brokers.
> commands topic is 10 partitions (replication 2, min-insync 2)
> command-expiry-store-changelog topic is 10 partitions (replication 2, 
> min-insync 2)
> events topic is 10 partitions (replication 2, min-insync 2)
> with this topology
> Topologies:
>  
> {code:java}
> Sub-topology: 0
>  Source: KSTREAM-SOURCE-00 (topics: [commands])
>  --> KSTREAM-TRANSFORM-01
>  Processor: KSTREAM-TRANSFORM-01 (stores: [])
>  --> KSTREAM-TRANSFORM-02
>  <-- KSTREAM-SOURCE-00
>  Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
>  --> KSTREAM-SINK-03
>  <-- KSTREAM-TRANSFORM-01
>  Sink: KSTREAM-SINK-03 (topic: events)
>  <-- KSTREAM-TRANSFORM-02
> {code}
> h3. 
> Attempt 1 at reproducing this issue
>  
> Our stream app runs with processing.guarantee *exactly_once* 
> After a Kafka test outage where all 3 brokers pod were deleted at the same 
> time,
> Brokers restarted and initialized succesfuly.
> When restarting the topology above, one of the tasks would never initialize 
> fully, the restore phase would keep outputting this messages every few 
> minutes:
>  
> {code:java}
> 2021-08-16 14:20:33,421 INFO stream-thread 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> Restoration in progress for 1 partitions. 
> {commands-processor-expiry-store-changelog-8: position=11775908, 
> end=11775911, totalRestored=2002076} 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> (org.apache.kafka.streams.processor.internals.StoreChangelogReader)
> {code}
> Task for partition 8 would never initialize, no more data would be read from 
> the source commands topic for that partition.
>  
> In an attempt to recover, we restarted the stream app with stream 
> processing.guarantee back to at_least_once, than it proceed with reading the 
> changelog and restoring partition 8 fully.
> But we noticed afterward, for the next hour until we rebuilt the system, that 
> partition 8 from command-expiry-store-changelog would not be 
> cleaned/compacted by the log cleaner/compacter compared to other partitions. 
> (could be unrelated, because we have seen that before)
> So we resorted to delete/recreate our command-expiry-store-changelog topic 
> and events topic and regenerate it from the commands, reading from beginning. 
> Things went back to normal
> h3. Attempt 2 at reproducing this issue
> We force-deleted all 3 pod running kafka.
> After that, one of the partition can’t be restored. (like reported in 
> previous attempt)
> For that partition, we noticed these logs on the broker
> {code:java}
> [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
> Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
> command-expiry-store-changelog-9) while trying to send transaction markers 
> for commands-processor-0_9, these partitions are likely deleted already and 
> hence can be skipped 
> (kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
> Then
> - we stop the kstream app,
> - restarted kafka brokers cleanly
> - Restarting the Kstream app, 
> Those logs messages showed up on the kstream app log:
>  
> {code:java}
> 2021-08-27 18:34:42,413 INFO [Consumer 
> clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
>  groupId=commands-processor] The following partitions still have unstable 
> offsets which are not cleared on the broker side: [commands-9], this could be 
> either transactional offsets waiting for completion, or normal offsets 
> waiting for replication after appending to local log 
> [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
>  
> {code}
> This would cause our processor to not consume from that specific source 
> topic-partition.
>  Deleting downstream topic and replaying data would NOT fix the issue 
> (EXACTLY_ONCE or AT_LEAST_ONCE)
> Workaround found:
> Deleted

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage when exactly_once 
.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  

Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
 StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.stream.threads=1

 

We will be doing more of test and I will update the

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage when exactly_once 
.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
 StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.stream.threads=1

 

We will be doing more of test and I will update

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage when exactly_once 
.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
{code:java}
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.stream.threads=1{code}
 

We will be doing more of test an

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage when exactly_once 
.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
{code:java}
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.stream.threads=1{code}
 

We will be doing more tests and

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage when exactly_once 
is enabled.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
{code:java}
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.stream.threads=1{code}
 

We will be doing more

[GitHub] [kafka] vamossagar12 commented on a change in pull request #11211: KAFKA-12960: Enforcing strict retention time for WindowStore and Sess…

2021-09-03 Thread GitBox



vamossagar12 commented on a change in pull request #11211:
URL: https://github.com/apache/kafka/pull/11211#discussion_r702070313



##
File path: 
streams/src/main/java/org/apache/kafka/streams/state/internals/MeteredWindowStore.java
##
@@ -292,13 +408,46 @@ public V fetch(final K key,
 time);
 }
 
+private long getActualWindowStartTime(final long timeFrom) {
+return Math.max(timeFrom, ((PersistentWindowStore) 
wrapped()).getObservedStreamTime() - retentionPeriod + 1);
+}
+
+private KeyValueIterator, V> filterExpiredRecords(final 
boolean forward) {
+final KeyValueIterator, byte[]> 
allWindowedKeyValueIterator = forward ? wrapped().all() : 
wrapped().backwardAll();
+
+final long observedStreamTime = ((PersistentWindowStore) wrapped()).getObservedStreamTime();
+if (!allWindowedKeyValueIterator.hasNext() || observedStreamTime == 
ConsumerRecord.NO_TIMESTAMP)
+return new 
MeteredWindowedKeyValueIterator<>(allWindowedKeyValueIterator, fetchSensor, 
streamsMetrics, serdes, time);
+
+final long windowStartBoundary = observedStreamTime - retentionPeriod 
+ 1;
+final List, byte[]>> 
windowedKeyValuesInBoundary = new ArrayList<>();
+
+while (allWindowedKeyValueIterator.hasNext()) {
+final KeyValue, byte[]> next = 
allWindowedKeyValueIterator.next();
+if 
(next.key.window().endTime().isBefore(Instant.ofEpochMilli(windowStartBoundary)))
 {
+continue;
+}
+windowedKeyValuesInBoundary.add(next);
+}
+return new MeteredWindowedKeyValueIterator<>(new 
WindowedKeyValueIterator(windowedKeyValuesInBoundary.iterator()), fetchSensor, 
streamsMetrics, serdes, time);
+}

Review comment:
   @mjsax i have a question here... In the jira ticket, you. have mentioned 
that the best place for adding this filtering is in the MeteredStore as that 
implicitly adds the logic even for custom state stores. While for the most 
part, this kind of filtering has worked fine(fetching relevant records and then 
filtering in MeteredStore) but there's a case where it's failing. It's for 
test. cases like `shouldNotThrowConcurrentModificationException` . This seems 
to be because the put() call while iterating is appending to the wrapped 
instance of iterator and hence it's not visible.
   
   Looking at this, do you think it would be a good idea to move this logic in 
the actual RocksDB implementations? Or do you think there's a better way to do 
it here in MeteredStore class itself?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-12634) Should checkpoint after restore finished

2021-09-03 Thread Guozhang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-12634:
--
Labels: newbie++  (was: )

> Should checkpoint after restore finished
> 
>
> Key: KAFKA-12634
> URL: https://issues.apache.org/jira/browse/KAFKA-12634
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.5.0
>Reporter: Matthias J. Sax
>Priority: Major
>  Labels: newbie++
>
> For state stores, Kafka Streams maintains local checkpoint files to track the 
> offsets of the state store changelog topics. The checkpoint is updated on 
> commit or when a task is closed cleanly.
> However, after a successful restore, the checkpoint is not written. Thus, if 
> an instance crashes after restore but before committing, even if the state is 
> on local disk the checkpoint file is missing (indicating that there is no 
> state) and thus state would be restored from scratch.
> While for most cases, the time between restore end and next commit is small, 
> there are cases when this time could be large, for example if there is no new 
> input data to be processed (if there is no input data, the commit would be 
> skipped).
> Thus, we should write the checkpoint file after a successful restore to close 
> this gap (or course, only for at-least-once processing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos

2021-09-03 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409653#comment-17409653
 ] 

Guozhang Wang commented on KAFKA-13243:
---

Thanks [~josep.prat] Could you also send an email for the KIP voting? Since it 
is a rather small KIP I think we do not need a DISCUSS thread.

> Differentiate metric latency measured in millis and nanos
> -
>
> Key: KAFKA-13243
> URL: https://issues.apache.org/jira/browse/KAFKA-13243
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Guozhang Wang
>Assignee: Josep Prat
>Priority: Major
>  Labels: needs-kip
>
> Today most of the client latency metrics are measured in millis, and some in 
> nanos. For those measured in nanos we usually differentiate them by having a 
> `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and 
> `io-time-ns-avg`. But there are a few that we obviously forgot to follow this 
> pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` 
> suffix and `total` has not. I did a quick search and found just three of them:
> * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total
> * io-wait-time-total -> io-wait-time-ns-total
> * iotime-total -> io-time-ns-total (note that there are two inconsistencies 
> on naming, the average metric is `io-time-ns-avg` whereas total is 
> `iotime-total`, I suggest we use `io-time` instead of `iotime` for both).
> We should change their name accordingly with the `-ns` suffix as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos

2021-09-03 Thread Josep Prat (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409657#comment-17409657
 ] 

Josep Prat commented on KAFKA-13243:


Sent right now

> Differentiate metric latency measured in millis and nanos
> -
>
> Key: KAFKA-13243
> URL: https://issues.apache.org/jira/browse/KAFKA-13243
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Guozhang Wang
>Assignee: Josep Prat
>Priority: Major
>  Labels: needs-kip
>
> Today most of the client latency metrics are measured in millis, and some in 
> nanos. For those measured in nanos we usually differentiate them by having a 
> `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and 
> `io-time-ns-avg`. But there are a few that we obviously forgot to follow this 
> pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` 
> suffix and `total` has not. I did a quick search and found just three of them:
> * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total
> * io-wait-time-total -> io-wait-time-ns-total
> * iotime-total -> io-time-ns-total (note that there are two inconsistencies 
> on naming, the average metric is `io-time-ns-avg` whereas total is 
> `iotime-total`, I suggest we use `io-time` instead of `iotime` for both).
> We should change their name accordingly with the `-ns` suffix as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [kafka] dajac merged pull request #11086: KAFKA-13103: add REBALANCE_IN_PROGRESS error as retriable error for AlterConsumerGroupOffsetsHandler

2021-09-03 Thread GitBox



dajac merged pull request #11086:
URL: https://github.com/apache/kafka/pull/11086


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Summary: KStream offset stuck after brokers outage  (was: KStream offset 
stuck after brokers outage when exactly_once enabled)

> KStream offset stuck after brokers outage
> -
>
> Key: KAFKA-13272
> URL: https://issues.apache.org/jira/browse/KAFKA-13272
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
> Environment: Kafka running on Kubernetes
> centos
>Reporter: F Méthot
>Priority: Major
>
> Our KStream app offset stay stuck on 1 partition after outage when 
> exactly_once is enabled.
> Running with KStream 2.8, kafka broker 2.8,
>  3 brokers.
> commands topic is 10 partitions (replication 2, min-insync 2)
>  command-expiry-store-changelog topic is 10 partitions (replication 2, 
> min-insync 2)
>  events topic is 10 partitions (replication 2, min-insync 2)
> with this topology
> Topologies:
>  
> {code:java}
> Sub-topology: 0
>  Source: KSTREAM-SOURCE-00 (topics: [commands])
>  --> KSTREAM-TRANSFORM-01
>  Processor: KSTREAM-TRANSFORM-01 (stores: [])
>  --> KSTREAM-TRANSFORM-02
>  <-- KSTREAM-SOURCE-00
>  Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
>  --> KSTREAM-SINK-03
>  <-- KSTREAM-TRANSFORM-01
>  Sink: KSTREAM-SINK-03 (topic: events)
>  <-- KSTREAM-TRANSFORM-02
> {code}
> h3.  
> h3. Attempt 1 at reproducing this issue
>  
> Our stream app runs with processing.guarantee *exactly_once* 
> After a Kafka test outage where all 3 brokers pod were deleted at the same 
> time,
> Brokers restarted and initialized succesfuly.
> When restarting the topology above, one of the tasks would never initialize 
> fully, the restore phase would keep outputting this messages every few 
> minutes:
>  
> {code:java}
> 2021-08-16 14:20:33,421 INFO stream-thread 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> Restoration in progress for 1 partitions. 
> {commands-processor-expiry-store-changelog-8: position=11775908, 
> end=11775911, totalRestored=2002076} 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> (org.apache.kafka.streams.processor.internals.StoreChangelogReader)
> {code}
> Task for partition 8 would never initialize, no more data would be read from 
> the source commands topic for that partition.
>  
> In an attempt to recover, we restarted the stream app with stream 
> processing.guarantee back to at_least_once, than it proceed with reading the 
> changelog and restoring partition 8 fully.
> But we noticed afterward, for the next hour until we rebuilt the system, that 
> partition 8 from command-expiry-store-changelog would not be 
> cleaned/compacted by the log cleaner/compacter compared to other partitions. 
> (could be unrelated, because we have seen that before)
> So we resorted to delete/recreate our command-expiry-store-changelog topic 
> and events topic and regenerate it from the commands, reading from beginning.
> Things went back to normal
> h3. Attempt 2 at reproducing this issue
> We force-deleted all 3 pod running kafka.
>  After that, one of the partition can’t be restored. (like reported in 
> previous attempt)
>  For that partition, we noticed these logs on the broker
> {code:java}
> [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
> Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
> command-expiry-store-changelog-9) while trying to send transaction markers 
> for commands-processor-0_9, these partitions are likely deleted already and 
> hence can be skipped 
> (kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
> Then
>  - we stop the kstream app,
>  - restarted kafka brokers cleanly
>  - Restarting the Kstream app, 
> Those logs messages showed up on the kstream app log:
>  
> {code:java}
> 2021-08-27 18:34:42,413 INFO [Consumer 
> clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
>  groupId=commands-processor] The following partitions still have unstable 
> offsets which are not cleared on the broker side: [commands-9], this could be 
> either transactional offsets waiting for completion, or normal offsets 
> waiting for replication after appending to local log 
> [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
>  
> {code}
> This would cause our processor to not consume from that specific source 
> topic-partition.
>   Deleting downstream topic and replaying data would NOT fix the issue 
> (EXACTLY_ONCE or AT_LEAST_ONCE)
> Workaround found:
> Deleted the group associated with th

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage possibly when 
exactly_once is enabled.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
{code:java}
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.stream.threads=1{code}
 

We will be do

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage possibly when 
exactly_once is enabled.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

kstream runs with *exactly-once*

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
{code:java}
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.strea

[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage

2021-09-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-
Description: 
Our KStream app offset stay stuck on 1 partition after outage possibly when 
exactly_once is enabled.

Running with KStream 2.8, kafka broker 2.8,
 3 brokers.

commands topic is 10 partitions (replication 2, min-insync 2)
 command-expiry-store-changelog topic is 10 partitions (replication 2, 
min-insync 2)
 events topic is 10 partitions (replication 2, min-insync 2)

with this topology

Topologies:

 
{code:java}
Sub-topology: 0
 Source: KSTREAM-SOURCE-00 (topics: [commands])
 --> KSTREAM-TRANSFORM-01
 Processor: KSTREAM-TRANSFORM-01 (stores: [])
 --> KSTREAM-TRANSFORM-02
 <-- KSTREAM-SOURCE-00
 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store])
 --> KSTREAM-SINK-03
 <-- KSTREAM-TRANSFORM-01
 Sink: KSTREAM-SINK-03 (topic: events)
 <-- KSTREAM-TRANSFORM-02
{code}
h3.  
h3. Attempt 1 at reproducing this issue

 

Our stream app runs with processing.guarantee *exactly_once* 

After a Kafka test outage where all 3 brokers pod were deleted at the same time,

Brokers restarted and initialized succesfuly.

When restarting the topology above, one of the tasks would never initialize 
fully, the restore phase would keep outputting this messages every few minutes:

 
{code:java}
2021-08-16 14:20:33,421 INFO stream-thread 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
Restoration in progress for 1 partitions. 
{commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, 
totalRestored=2002076} 
[commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
(org.apache.kafka.streams.processor.internals.StoreChangelogReader)
{code}
Task for partition 8 would never initialize, no more data would be read from 
the source commands topic for that partition.

 

In an attempt to recover, we restarted the stream app with stream 
processing.guarantee back to at_least_once, than it proceed with reading the 
changelog and restoring partition 8 fully.

But we noticed afterward, for the next hour until we rebuilt the system, that 
partition 8 from command-expiry-store-changelog would not be cleaned/compacted 
by the log cleaner/compacter compared to other partitions. (could be unrelated, 
because we have seen that before)

So we resorted to delete/recreate our command-expiry-store-changelog topic and 
events topic and regenerate it from the commands, reading from beginning.

Things went back to normal
h3. Attempt 2 at reproducing this issue

kstream runs with *exactly-once*

We force-deleted all 3 pod running kafka.
 After that, one of the partition can’t be restored. (like reported in previous 
attempt)
 For that partition, we noticed these logs on the broker
{code:java}
[2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
command-expiry-store-changelog-9) while trying to send transaction markers for 
commands-processor-0_9, these partitions are likely deleted already and hence 
can be skipped 
(kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
Then
 - we stop the kstream app,

 - restarted kafka brokers cleanly

 - Restarting the Kstream app, 

Those logs messages showed up on the kstream app log:

 
{code:java}
2021-08-27 18:34:42,413 INFO [Consumer 
clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
 groupId=commands-processor] The following partitions still have unstable 
offsets which are not cleared on the broker side: [commands-9], this could be 
either transactional offsets waiting for completion, or normal offsets waiting 
for replication after appending to local log 
[commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
 
{code}
This would cause our processor to not consume from that specific source 
topic-partition.
  Deleting downstream topic and replaying data would NOT fix the issue 
(EXACTLY_ONCE or AT_LEAST_ONCE)

Workaround found:

Deleted the group associated with the processor, and restarted the kstream 
application, application went on to process data normally. (We have resigned to 
use AT_LEAST_ONCE for now )

KStream config :
{code:java}
StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
 ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
 StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
AT_LEAST_ONCE)
 producer.delivery.timeout.ms=12
 consumer.session.timeout.ms=3
 consumer.heartbeat.interval.ms=1
 consumer.max.poll.interval.ms=30
 num.strea

[GitHub] [kafka] ijuma opened a new pull request #11295: KAFKA-13270: Set JUTE_MAXBUFFER to 4 MB by default

2021-09-03 Thread GitBox



ijuma opened a new pull request #11295:
URL: https://github.com/apache/kafka/pull/11295


   ZooKeeper 3.6.0 changed the default configuration for JUTE_MAXBUFFER from 4 
MB to 1 MB.
   This causes a regression if Kafka tries to retrieve a large amount of data 
across many
   znodes – in such a case the ZooKeeper client will repeatedly emit a message 
of the form
   "java.io.IOException: Packet len <> is out of range".
   
   We restore the 3.4.x/3.5.x behavior unless the caller has set the property 
(note that ZKConfig
   auto configures itself if certain system properties have been set).
   
   I added a unit test that fails without the change and passes with it.

   See https://github.com/apache/zookeeper/pull/1129 for the details on why the 
behavior
   changed in 3.6.0.
   
   Credit to @rondagostino for finding and reporting this issue.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (KAFKA-13206) shutting down broker needs to stop fetching as a follower in KRaft mode

2021-09-03 Thread HaiyuanZhao (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HaiyuanZhao reassigned KAFKA-13206:
---

Assignee: HaiyuanZhao

> shutting down broker needs to stop fetching as a follower in KRaft mode
> ---
>
> Key: KAFKA-13206
> URL: https://issues.apache.org/jira/browse/KAFKA-13206
> Project: Kafka
>  Issue Type: Bug
>  Components: core, kraft, replication
>Affects Versions: 3.0.0
>Reporter: Jun Rao
>Assignee: HaiyuanZhao
>Priority: Major
>  Labels: kip-500
>
> In the ZK mode, the controller will send a stopReplica(with deletion flag as 
> false) request to the shutting down broker so that it will stop the followers 
> from fetching. In KRaft mode, we don't have a corresponding logic. This means 
> unnecessary rejected fetch follower requests during controlled shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10859) add @Test annotation to FileStreamSourceTaskTest.testInvalidFile and reduce the loop count to speedup the test

2021-09-03 Thread Borzoo Esmailloo (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-10859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409847#comment-17409847
 ] 

Borzoo Esmailloo commented on KAFKA-10859:
--

Hey [~chia7712], could you please take a look at my PR? :)

> add @Test annotation to FileStreamSourceTaskTest.testInvalidFile and reduce 
> the loop count to speedup the test
> --
>
> Key: KAFKA-10859
> URL: https://issues.apache.org/jira/browse/KAFKA-10859
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Chia-Ping Tsai
>Assignee: Tom Bentley
>Priority: Major
>  Labels: newbie
>
> FileStreamSourceTaskTest.testInvalidFile miss a `@Test` annotation. Also, it 
> loops 100 times which spend about 2m to complete a unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13261) KTable to KTable foreign key join loose events when using several partitions

2021-09-03 Thread Tomas Forsman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409883#comment-17409883
 ] 

Tomas Forsman commented on KAFKA-13261:
---

Hi [~guozhang], the local one is full-fledged, however scaled down, setup 
running the actual services needed to reproduce the problem. We do have test 
environment and staging environment with complete setup where we also can 
reproduce the problem.

Thanks [~vvcephei], i'll look into setting up a test case based on the one you 
sent me.

About removing the partitioned, the example I've given is the minimal one 
needed to show the problem. Our typologies are bigger in general and we where 
hoping to be able to use several partitions instead of just one. However I will 
follow your suggestion and try without it.

 

 

 

> KTable to KTable foreign key join loose events when using several partitions
> 
>
> Key: KAFKA-13261
> URL: https://issues.apache.org/jira/browse/KAFKA-13261
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Tomas Forsman
>Priority: Major
> Attachments: KafkaTest.java
>
>
> Two incoming streams A and B. 
> Stream A uses a composite key [a, b]
> Stream B has key [b]
> Stream B has 4 partitions and steams A has 1 partition.
> What we try to do is repartition stream A to have 4 partitions too, then put 
> both A and B into KTable and do a foreign key join on from A to B
> When doing this, all messages does not end up in the output topic.
> Repartitioning both to only use 1 partition each solve the problem so it seem 
> like it has something to do with the foreign key join in combination with 
> several partitions. 
> One suspicion would be that it is not possible to define what partitioner to 
> use for the join.
> Any insight or help is greatly appreciated.
> *Example code of the problem*
> {code:java}
> static Topology createTopoology(){
> var builder = new StreamsBuilder();
> KTable tableB = builder.table("B",  
> stringMaterialized("table.b"));
> builder
> .stream("A", Consumed.with(Serde.of(KeyA.class), 
> Serde.of(EventA.class)))
> .repartition(repartitionTopicA())
> .toTable(Named.as("table.a"), aMaterialized("table.a"))
> .join(tableB, EventA::getKeyB, topicAandBeJoiner(), 
> Named.as("join.ab"), joinMaterialized("join.ab"))
> .toStream()
> .to("output", with(...));
> return builder.build();
> }
> private static Materialized aMaterialized(String name) {
>   Materialized> table = 
> Materialized.as(name);
>   return 
> table.withKeySerde(Serde.of(KeyA.class)).withValueSerde(Serde.of(EventA.class));
> }
> private static Repartitioned repartitionTopicA() {
> Repartitioned repartitioned = 
> Repartitioned.as("driverperiod");
> return 
> repartitioned.withKeySerde(Serde.of(KeyA.class)).withValueSerde(Serde.of(EventA.class))
> .withStreamPartitioner(topicAPartitioner())
> .withNumberOfPartitions(4);
> }
> private static StreamPartitioner 
> topicAPartitioner() {
> return (topic, key, value, numPartitions) -> 
> Math.abs(key.getKeyB().hashCode()) % numPartitions;
> }
> private static Materialized> 
> joinMaterialized(String name) {
> Materialized> 
> table = Materialized.as(name);
> return 
> table.withKeySerde(Serde.of(KeyA.class)).withValueSerde(Serde.of(EventA.class));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

37 matches

Mail list logo