[jira] [Resolved] (KAFKA-9553) Transaction state loading metric does not count total loading time

2020-03-19 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-9553.

Fix Version/s: 2.6.0
   Resolution: Fixed

> Transaction state loading metric does not count total loading time
> --
>
> Key: KAFKA-9553
> URL: https://issues.apache.org/jira/browse/KAFKA-9553
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Agam Brahma
>Priority: Major
> Fix For: 2.6.0
>
>
> KIP-484 added a metric to track total loading time for internal topics: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+group+and+transaction+metadata+loading+duration.
>  The value seems to be being updated incorrectly in TransactionStateManager. 
> Rather than recording the total loading time, it records the loading 
> separately after every read from the log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9553) Transaction state loading metric does not count total loading time

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063117#comment-17063117
 ] 

ASF GitHub Bot commented on KAFKA-9553:
---

hachikuji commented on pull request #8155: KAFKA-9553: Improve measurement for 
loading groups and transactions
URL: https://github.com/apache/kafka/pull/8155
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Transaction state loading metric does not count total loading time
> --
>
> Key: KAFKA-9553
> URL: https://issues.apache.org/jira/browse/KAFKA-9553
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Agam Brahma
>Priority: Major
>
> KIP-484 added a metric to track total loading time for internal topics: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+group+and+transaction+metadata+loading+duration.
>  The value seems to be being updated incorrectly in TransactionStateManager. 
> Rather than recording the total loading time, it records the loading 
> separately after every read from the log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9721) ReassignPartitionsCommand#reassignPartitions doesn't wait for the completion of altering replica folder

2020-03-19 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-9721.
---
Resolution: Won't Fix

ReassignPartitionsCommand does not guarantee to "complete" the replica move so 
close this issue as "Won't Fix".

> ReassignPartitionsCommand#reassignPartitions doesn't wait for the completion 
> of altering replica folder
> ---
>
> Key: KAFKA-9721
> URL: https://issues.apache.org/jira/browse/KAFKA-9721
> Project: Kafka
>  Issue Type: Bug
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Major
>
> {code:scala}
> val replicasAssignedToFutureDir = 
> mutable.Set.empty[TopicPartitionReplica]
> while (remainingTimeMs > 0 && replicasAssignedToFutureDir.size < 
> proposedReplicaAssignment.size) {
>   replicasAssignedToFutureDir ++= 
> alterReplicaLogDirsIgnoreReplicaNotAvailable(
> proposedReplicaAssignment.filter { case (replica, _) => 
> !replicasAssignedToFutureDir.contains(replica) },
> adminClientOpt.get, remainingTimeMs)
>   Thread.sleep(100)
>   remainingTimeMs = startTimeMs + timeoutMs - 
> System.currentTimeMillis()
> }
> {code}
> The response of altering replica folder is NOT the completed folder since the 
> alter process executes on another thread to move the data from source to 
> target. Hence, it should depend on the response of #describeLogDirs rather 
> than #alterReplicaLogDirs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8820) kafka-reassign-partitions.sh should support the KIP-455 API

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063057#comment-17063057
 ] 

ASF GitHub Bot commented on KAFKA-8820:
---

cmccabe commented on pull request #8244: KAFKA-8820: 
kafka-reassign-partitions.sh should support the KIP-455 API
URL: https://github.com/apache/kafka/pull/8244
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> kafka-reassign-partitions.sh should support the KIP-455 API
> ---
>
> Key: KAFKA-8820
> URL: https://issues.apache.org/jira/browse/KAFKA-8820
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Gwen Shapira
>Assignee: Colin McCabe
>Priority: Major
>
> KIP-455 and KAFKA-8345 add a protocol and AdminAPI that will be used for 
> replica reassignments. We need to update the reassignment tool to use this 
> new API rather than work with ZK directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9733) Consider addition of leader quorum in replication model

2020-03-19 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated KAFKA-9733:
--
Description: 
Kafka's current replication model (with its single leader and several 
followers) is somewhat similar to the current consensus algorithms being used 
in databases (RAFT) with the major difference being the existence of the ISR. 
Consequently, Kafka suffers from the same fault tolerance issues as does other 
distributed systems which rely on RAFT: the leader tends to be the chokepoint 
for failures i.e. if it goes down, it will have a brief stop-the-world effect. 

In contrast, giving all replicas the power to write and read to other replicas 
is also difficult to accomplish (as emphasized by the complexity of the 
Egalitarian Paxos algorithm), since consistency is so hard to maintain in such 
an algorithm, plus very little gain compared to the overhead. 

Therefore, I propose that we have an intermediate plan in between these two 
algorithms, and that is the leader replica quorum. In essence, there will be 
multiple leaders (which have the power for both read and writes), but the 
number of leaders will not be excessive (i.e. maybe three at max). How we 
achieve consistency is simple:
 * Any leader has the power to propose a write update to other replicas. But 
before passing a write update to a follower, the other leaders must elect if 
such an operation is granted.
 * In principle, a leader will propose a write update to the other leaders, and 
once the other leaders have integrated that write update into their version of 
the stored data, they will also give the green light. 
 * If say, more than half the other leaders have agreed that the current change 
is good to go, then we can forward the change downstream to the other replicas.

 The algorithm for maintaining consistency between multiple leaders will still 
have to be worked out in detail. However, there would be multiple gains from 
this design over the old model:
 # The single leader failure bottleneck has been alleviated to a certain 
extent, since there are now multiple leader replicas.
 # Write updates will potentially no longer be bottlenecked at one single 
leader (since there are multiple leaders available). On a related note, there 
has been a KIP that allows clients to read from non-leader replicas. (will add 
the KIP link soon).

Some might note that the overhead from maintaining consistency among multiple 
leaders might offset these gains. That might be true, with a large number of 
leaders, but with a small number then (capped at 3 as mentioned above), the 
overhead will also be correspondingly small. (How latency will be affected is 
unknown until further testing, but more than likely, this option will probably 
be. configurable depending on user requirements).

 

 

  was:
Kafka's current replication model (with its single leader and several 
followers) is somewhat similar to the current consensus algorithms being used 
in databases (RAFT) with the major difference being the existence of the ISR. 
Consequently, Kafka suffers from the same fault tolerance issues as does other 
distributed systems which rely on RAFT: the leader tends to be the chokepoint 
for failures i.e. if it goes down, it will have a brief stop-the-world effect. 

In contrast, giving all replicas the power to write and read to other replicas 
is also difficult to accomplish (as emphasized by the complexity of the 
Egalitarian Paxos algorithm), since consistency is so hard to maintain in such 
an algorithm, plus very little gain compared to the overhead. 

Therefore, I propose that we have an intermediate plan in between these two 
algorithms, and that is the leader replica quorum. In essence, there will be 
multiple leaders (which have the power for both read and writes), but the 
number of leaders will not be excessive (i.e. maybe three or five at max). How 
we achieve consistency is simple:
 * Any leader has the power to propose a write update to other replicas. But 
before passing a write update to a follower, the other leaders must elect if 
such an operation is granted.
 * In principle, a leader will propose a write update to the other leaders, and 
once the other leaders have integrated that write update into their version of 
the stored data, they will also give the green light. 
 * If say, more than half the other leaders have agreed that the current change 
is good to go, then we can forward the change downstream to the other replicas.

 The algorithm for maintaining consistency between multiple leaders will still 
have to be worked out in detail. However, there would be multiple gains from 
this design over the old model:
 # The single leader failure bottleneck has been alleviated to a certain 
extent, since there are now multiple leader replicas.
 # Write updates will potentially no longer be bottlenecked at one single 
leader (since there are mul

[jira] [Updated] (KAFKA-9658) Removing default user quota doesn't take effect until broker restart

2020-03-19 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-9658:
---
Fix Version/s: (was: 2.5.1)
   2.5.0

> Removing default user quota doesn't take effect until broker restart
> 
>
> Key: KAFKA-9658
> URL: https://issues.apache.org/jira/browse/KAFKA-9658
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.1, 2.2.2, 2.4.0, 2.3.1
>Reporter: Anna Povzner
>Assignee: Anna Povzner
>Priority: Major
> Fix For: 2.5.0, 2.3.2, 2.4.2
>
>
> To reproduce (for any quota type: produce, consume, and request):
> Example with consumer quota, assuming no user/client quotas are set initially.
> 1. Set default user consumer quotas:
> {{./kafka-configs.sh --zookeeper  --alter --add-config 
> 'consumer_byte_rate=1' --entity-type users --entity-default}}
> {{2. Send some consume load for some user, say user1.}}
> {{3. Remove default user consumer quota using:}}
> {{./kafka-configs.sh --zookeeper  --alter --delete-config 
> 'consumer_byte_rate' --entity-type users --entity-default}}
> Result: --describe (as below) returns correct result that there is no quota, 
> but quota bound in ClientQuotaManager.metrics does not get updated for users 
> that were sending load, which causes the broker to continue throttling 
> requests with the previously set quota.
>  {{/opt/confluent/bin/kafka-configs.sh --zookeeper   --describe 
> --entity-type users --entity-default}}
> {{}}{{}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9737) Describing log dir reassignment times out if broker is offline

2020-03-19 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-9737:
--

 Summary: Describing log dir reassignment times out if broker is 
offline
 Key: KAFKA-9737
 URL: https://issues.apache.org/jira/browse/KAFKA-9737
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


If there is any broker offline when trying to describe a log dir reassignment, 
then we get the something like the following error:
{code}
Status of partition reassignment:   

   Partitions reassignment failed due to 
org.apache.kafka.common.errors.TimeoutException: 
Call(callName=describeReplicaLogDirs, deadlineMs=1584663960068, tries=1, 
nextAllowedTryMs=158466
3960173) timed out at 1584663960073 after 1 attempt(s)  

   
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.TimeoutException: 
Call(callName=describeReplicaLogDirs, deadlineMs=1584663960068, tries=1, 
nextAllowedTryMs=158
4663960173) timed out at 1584663960073 after 1 attempt(s)   

  
at 
org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)

 
at 
org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
   
at 
org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
   
at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) 

   
at 
kafka.admin.ReassignPartitionsCommand$.checkIfReplicaReassignmentSucceeded(ReassignPartitionsCommand.scala:381)
   
at 
kafka.admin.ReassignPartitionsCommand$.verifyAssignment(ReassignPartitionsCommand.scala:98)
 
at 
kafka.admin.ReassignPartitionsCommand$.verifyAssignment(ReassignPartitionsCommand.scala:90)
at 
kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:61)
at 
kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
Caused by: org.apache.kafka.common.errors.TimeoutException: 
Call(callName=describeReplicaLogDirs, deadlineMs=1584663960068, tries=1, 
nextAllowedTryMs=1584663960173) timed out at 15846
63960073 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting 
for a node assignment.
{code}
It would be nice if the tool was smart enough to notice brokers that are 
offline and report them as such while reporting the status of reassignments for 
online brokers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9654) ReplicaAlterLogDirsThread can't be created again if the previous ReplicaAlterLogDirsThreadmeet encounters leader epoch error

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063000#comment-17063000
 ] 

ASF GitHub Bot commented on KAFKA-9654:
---

hachikuji commented on pull request #8223: KAFKA-9654 ReplicaAlterLogDirsThread 
can't be created again if the pr…
URL: https://github.com/apache/kafka/pull/8223
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> ReplicaAlterLogDirsThread can't be created again if the previous 
> ReplicaAlterLogDirsThreadmeet encounters leader epoch error
> 
>
> Key: KAFKA-9654
> URL: https://issues.apache.org/jira/browse/KAFKA-9654
> Project: Kafka
>  Issue Type: Bug
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Critical
>
> ReplicaManager does create ReplicaAlterLogDirsThread only if an new future 
> log is created. If the previous ReplicaAlterLogDirsThread encounters error 
> when moving data, the target partition is moved to "failedPartitions" and 
> ReplicaAlterLogDirsThread get idle due to empty partitions. The future log is 
> still existent so we CAN'T either create another ReplicaAlterLogDirsThread to 
> handle the parition or update the paritions of the idler 
> ReplicaAlterLogDirsThread.
> ReplicaManager should call ReplicaAlterLogDirsManager#addFetcherForPartitions 
> even if there is already a future log since we can create an new 
> ReplicaAlterLogDirsThread to handle the new partitions or update the 
> partitions of existent ReplicaAlterLogDirsThread to make it busy again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9225) kafka fail to run on linux-aarch64

2020-03-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax reassigned KAFKA-9225:
--

Assignee: jiamei xie

> kafka fail to run on linux-aarch64
> --
>
> Key: KAFKA-9225
> URL: https://issues.apache.org/jira/browse/KAFKA-9225
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: jiamei xie
>Assignee: jiamei xie
>Priority: Blocker
>  Labels: incompatible
> Attachments: compat_report.html
>
>
> *Steps to reproduce:*
> 1. Download Kafka latest source code
> 2. Build it with gradle
> 3. Run 
> [streamDemo|[https://kafka.apache.org/23/documentation/streams/quickstart]]
> when running bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo, it crashed with 
> the following error message
> {code:java}
> xjm@ubuntu-arm01:~/kafka$bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/core/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/tools/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/api/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/transforms/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/runtime/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/file/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror-client/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/json/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/basic-auth-extension/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [2019-11-19 15:42:23,277] WARN The configuration 'admin.retries' was supplied 
> but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:23,278] WARN The configuration 'admin.retry.backoff.ms' was 
> supplied but isn't a known config. 
> (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:24,278] ERROR stream-client 
> [streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48] All stream threads 
> have died. The instance will be in error state and should be closed. 
> (org.apach e.kafka.streams.KafkaStreams)
> Exception in thread 
> "streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48-StreamThread-1" 
> java.lang.UnsatisfiedLinkError: /tmp/librocksdbjni1377754636857652484.so: 
> /tmp/librocksdbjni13777546368576524 84.so: 
> cannot open shared object file: No such file or directory (Possible cause: 
> can't load AMD 64-bit .so on a AARCH64-bit platform)
> {code}
>  *Analyze:*
> This issue is caused by rocksdbjni-5.18.3.jar which doesn't come with aarch64 
> native support. Replace rocksdbjni-5.18.3.jar with rocksdbjni-6.3.6.jar from 
> [https://mvnrepository.com/artifact/org.rocksdb/rocksdbjni/6.3.6] can fix 
> this problem.
> Attached is the binary compatibility report of rocksdbjni.jar between 5.18.3 
> and 6.3.6. The result is 81.8%. So is it possible to upgrade rocksdbjni to 
> 6.3.6 in upstream? Should there be any kind of tests to execute, please 
> kindly point me. Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9225) kafka fail to run on linux-aarch64

2020-03-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-9225:
---
Priority: Major  (was: Blocker)

> kafka fail to run on linux-aarch64
> --
>
> Key: KAFKA-9225
> URL: https://issues.apache.org/jira/browse/KAFKA-9225
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: jiamei xie
>Assignee: jiamei xie
>Priority: Major
>  Labels: incompatible
> Fix For: 2.6.0
>
> Attachments: compat_report.html
>
>
> *Steps to reproduce:*
> 1. Download Kafka latest source code
> 2. Build it with gradle
> 3. Run 
> [streamDemo|[https://kafka.apache.org/23/documentation/streams/quickstart]]
> when running bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo, it crashed with 
> the following error message
> {code:java}
> xjm@ubuntu-arm01:~/kafka$bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/core/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/tools/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/api/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/transforms/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/runtime/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/file/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror-client/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/json/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/basic-auth-extension/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [2019-11-19 15:42:23,277] WARN The configuration 'admin.retries' was supplied 
> but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:23,278] WARN The configuration 'admin.retry.backoff.ms' was 
> supplied but isn't a known config. 
> (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:24,278] ERROR stream-client 
> [streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48] All stream threads 
> have died. The instance will be in error state and should be closed. 
> (org.apach e.kafka.streams.KafkaStreams)
> Exception in thread 
> "streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48-StreamThread-1" 
> java.lang.UnsatisfiedLinkError: /tmp/librocksdbjni1377754636857652484.so: 
> /tmp/librocksdbjni13777546368576524 84.so: 
> cannot open shared object file: No such file or directory (Possible cause: 
> can't load AMD 64-bit .so on a AARCH64-bit platform)
> {code}
>  *Analyze:*
> This issue is caused by rocksdbjni-5.18.3.jar which doesn't come with aarch64 
> native support. Replace rocksdbjni-5.18.3.jar with rocksdbjni-6.3.6.jar from 
> [https://mvnrepository.com/artifact/org.rocksdb/rocksdbjni/6.3.6] can fix 
> this problem.
> Attached is the binary compatibility report of rocksdbjni.jar between 5.18.3 
> and 6.3.6. The result is 81.8%. So is it possible to upgrade rocksdbjni to 
> 6.3.6 in upstream? Should there be any kind of tests to execute, please 
> kindly point me. Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9225) kafka fail to run on linux-aarch64

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062985#comment-17062985
 ] 

ASF GitHub Bot commented on KAFKA-9225:
---

mjsax commented on pull request #8284: KAFKA-9225: Bump rocksdb 5.18.3 to 5.18.4
URL: https://github.com/apache/kafka/pull/8284
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> kafka fail to run on linux-aarch64
> --
>
> Key: KAFKA-9225
> URL: https://issues.apache.org/jira/browse/KAFKA-9225
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: jiamei xie
>Priority: Blocker
>  Labels: incompatible
> Attachments: compat_report.html
>
>
> *Steps to reproduce:*
> 1. Download Kafka latest source code
> 2. Build it with gradle
> 3. Run 
> [streamDemo|[https://kafka.apache.org/23/documentation/streams/quickstart]]
> when running bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo, it crashed with 
> the following error message
> {code:java}
> xjm@ubuntu-arm01:~/kafka$bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/core/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/tools/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/api/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/transforms/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/runtime/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/file/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror-client/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/json/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/basic-auth-extension/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [2019-11-19 15:42:23,277] WARN The configuration 'admin.retries' was supplied 
> but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:23,278] WARN The configuration 'admin.retry.backoff.ms' was 
> supplied but isn't a known config. 
> (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:24,278] ERROR stream-client 
> [streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48] All stream threads 
> have died. The instance will be in error state and should be closed. 
> (org.apach e.kafka.streams.KafkaStreams)
> Exception in thread 
> "streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48-StreamThread-1" 
> java.lang.UnsatisfiedLinkError: /tmp/librocksdbjni1377754636857652484.so: 
> /tmp/librocksdbjni13777546368576524 84.so: 
> cannot open shared object file: No such file or directory (Possible cause: 
> can't load AMD 64-bit .so on a AARCH64-bit platform)
> {code}
>  *Analyze:*
> This issue is caused by rocksdbjni-5.18.3.jar which doesn't come with aarch64 
> native support. Replace rocksdbjni-5.18.3.jar with rocksdbjni-6.3.6.jar from 
> [https://mvnrepository.com/artifact/org.rocksdb/rocksdbjni/6.3.6] can fix 
> this problem.
> Attached is the binary compatibility report of rocksdbjni.jar between 5.18.3 
> and 6.3.6. The result is 81.8%. So is it possible to upgrade rocksdbjni to 
> 6.3.6 in upstream? Should there be any kind of tests to execute, please 
> kindly point me. Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9734) Streams IllegalStateException in trunk during rebalance

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062969#comment-17062969
 ] 

ASF GitHub Bot commented on KAFKA-9734:
---

vvcephei commented on pull request #8319: KAFKA-9734: Fix IllegalState in 
Streams transit to standby
URL: https://github.com/apache/kafka/pull/8319
 
 
   Consolidate ChangelogReader state management inside of StreamThread to avoid 
having to reason about all execution paths in both StreamThread and TaskManager.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Streams IllegalStateException in trunk during rebalance
> ---
>
> Key: KAFKA-9734
> URL: https://issues.apache.org/jira/browse/KAFKA-9734
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
>
> I have observed the following exception to kill a thread in the current trunk 
> of Streams:
> {noformat}
> [2020-03-19 07:04:35,206] ERROR 
> [stream-soak-test-e60443b4-aa2d-4107-abf7-ce90407cb70e-StreamThread-1] 
> stream-thread [stream-soak-test-e60443b4-aa
> 2d-4107-abf7-ce90407cb70e-StreamThread-1] Encountered the following exception 
> during processing and the thread is going to shut down:  
> (org.apache.kafka.streams.processor.internals.StreamThread)
> java.lang.IllegalStateException: The changelog reader is not restoring active 
> tasks while trying to transit to update standby tasks: {}
> at 
> org.apache.kafka.streams.processor.internals.StoreChangelogReader.transitToUpdateStandby(StoreChangelogReader.java:303)
>   
>
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:582)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:498)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:472){noformat}
> Judging from the fact that the standby tasks are reported as an empty set, I 
> think this is just a missed case in the task manager. PR to follow shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2020-03-19 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062950#comment-17062950
 ] 

Matthias J. Sax commented on KAFKA-7965:


[https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5290/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/]

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9625) Unable to Describe broker configurations that have been set via IncrementalAlterConfigs

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062924#comment-17062924
 ] 

ASF GitHub Bot commented on KAFKA-9625:
---

cmccabe commented on pull request #8206: KAFKA-9625: Broker throttles are 
incorrectly marked as sensitive configurations
URL: https://github.com/apache/kafka/pull/8206
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unable to Describe broker configurations that have been set via 
> IncrementalAlterConfigs
> ---
>
> Key: KAFKA-9625
> URL: https://issues.apache.org/jira/browse/KAFKA-9625
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Colin McCabe
>Assignee: Sanjana Kaundinya
>Priority: Critical
> Fix For: 2.6.0
>
>
> There seem to be at least two bugs in the broker configuration APIs and/or 
> logic:
> 1. Broker throttles are incorrectly marked as sensitive configurations.  This 
> includes leader.replication.throttled.rate, 
> follower.replication.throttled.rate, 
> replica.alter.log.dirs.io.max.bytes.per.second.  This means that their values 
> cannot be read back by DescribeConfigs after they are set.
> 2. When we clear the broker throttles via incrementalAlterConfigs, 
> DescribeConfigs continues returning the old throttles indefinitely.  In other 
> words, the clearing is not reflected in the Describe API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9676) Add test coverage for new ActiveTaskCreator and StandbyTaskCreator

2020-03-19 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062919#comment-17062919
 ] 

Matthias J. Sax commented on KAFKA-9676:


[https://github.com/apache/kafka/pull/8318] adds all missing tests for 
`ActiveTaskCreator`.

> Add test coverage for new ActiveTaskCreator and StandbyTaskCreator
> --
>
> Key: KAFKA-9676
> URL: https://issues.apache.org/jira/browse/KAFKA-9676
> Project: Kafka
>  Issue Type: Test
>  Components: streams
>Reporter: Boyang Chen
>Priority: Major
>  Labels: help-wanted, newbie
>
> The newly separated ActiveTaskCreator and StandbyTaskCreator have no unit 
> test coverage. We should add corresponding tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9451) Pass consumer group metadata to producer on commit

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062911#comment-17062911
 ] 

ASF GitHub Bot commented on KAFKA-9451:
---

mjsax commented on pull request #8318: KAFKA-9451: Enable producer per thread 
for Streams EOS
URL: https://github.com/apache/kafka/pull/8318
 
 
   Enabled producer per thread via KIP-447.
   
   - add configs to enable EOS-beta and for a safe upgrade from older versions
   - for eos-beta, create a single shared producer and `StreamsProducer` over 
all tasks
   - when committing, commit all tasks
   
   Call for review @abbccdda @guozhangwang 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Pass consumer group metadata to producer on commit
> --
>
> Key: KAFKA-9451
> URL: https://issues.apache.org/jira/browse/KAFKA-9451
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> Using producer per thread EOS design, we need to pass the consumer group 
> metadata into `producer.sendOffsetsToTransaction()` to use the new consumer 
> group coordinator fenchning mechanism. We should also reduce the default 
> transaction timeout to 10 seconds (compare the KIP for details).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9736) Use internal consumer config to fail for unsupported protocol

2020-03-19 Thread Matthias J. Sax (Jira)
Matthias J. Sax created KAFKA-9736:
--

 Summary: Use internal consumer config to fail for unsupported 
protocol
 Key: KAFKA-9736
 URL: https://issues.apache.org/jira/browse/KAFKA-9736
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: Matthias J. Sax


Via KAFKA-9657, we added an internal config that allows us to fail the consumer 
if the broker does not support fetch offset "blocking".

If eos-beta is enabled (also for upgrade mode), we should enable this flag in 
Kafka Streams.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9736) Use internal consumer config to fail for unsupported protocol

2020-03-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax reassigned KAFKA-9736:
--

Assignee: Matthias J. Sax

> Use internal consumer config to fail for unsupported protocol
> -
>
> Key: KAFKA-9736
> URL: https://issues.apache.org/jira/browse/KAFKA-9736
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> Via KAFKA-9657, we added an internal config that allows us to fail the 
> consumer if the broker does not support fetch offset "blocking".
> If eos-beta is enabled (also for upgrade mode), we should enable this flag in 
> Kafka Streams.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-19 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062874#comment-17062874
 ] 

Guozhang Wang commented on KAFKA-8803:
--

I guess what I tried to convey is that it's not 100 percent verified that your 
bumped issue is KAFKA-9144 for sure. There maybe other root causes that are yet 
to be dug out, but still if you upgraded to 2.4.1 and did not see this anymore 
that can also tell us something :)

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, 
> logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9441) Refactor commit logic

2020-03-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax resolved KAFKA-9441.

Fix Version/s: 2.6.0
   Resolution: Fixed

> Refactor commit logic
> -
>
> Key: KAFKA-9441
> URL: https://issues.apache.org/jira/browse/KAFKA-9441
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
> Fix For: 2.6.0
>
>
> Using producer per thread in combination with EOS, it's not possible any 
> longer to commit individual task independently (as done currently).
> We need to refactor StreamsThread, to commit all tasks at the same time for 
> the new model.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9441) Refactor commit logic

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062847#comment-17062847
 ] 

ASF GitHub Bot commented on KAFKA-9441:
---

mjsax commented on pull request #8218: KAFKA-9441: Unify committing within 
TaskManager
URL: https://github.com/apache/kafka/pull/8218
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor commit logic
> -
>
> Key: KAFKA-9441
> URL: https://issues.apache.org/jira/browse/KAFKA-9441
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> Using producer per thread in combination with EOS, it's not possible any 
> longer to commit individual task independently (as done currently).
> We need to refactor StreamsThread, to commit all tasks at the same time for 
> the new model.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-6145) Warm up new KS instances before migrating tasks - potentially a two phase rebalance

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062838#comment-17062838
 ] 

ASF GitHub Bot commented on KAFKA-6145:
---

vvcephei commented on pull request #8282: KAFKA-6145: add new assignment configs
URL: https://github.com/apache/kafka/pull/8282
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Warm up new KS instances before migrating tasks - potentially a two phase 
> rebalance
> ---
>
> Key: KAFKA-6145
> URL: https://issues.apache.org/jira/browse/KAFKA-6145
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Antony Stubbs
>Assignee: Sophie Blee-Goldman
>Priority: Major
>  Labels: needs-kip
>
> Currently when expanding the KS cluster, the new node's partitions will be 
> unavailable during the rebalance, which for large states can take a very long 
> time, or for small state stores even more than a few ms can be a deal breaker 
> for micro service use cases.
> One workaround would be two execute the rebalance in two phases:
> 1) start running state store building on the new node
> 2) once the state store is fully populated on the new node, only then 
> rebalance the tasks - there will still be a rebalance pause, but would be 
> greatly reduced
> Relates to: KAFKA-6144 - Allow state stores to serve stale reads during 
> rebalance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9735) Kafka Connect REST endpoint doesn't consider keystore/truststore

2020-03-19 Thread SledgeHammer (Jira)
SledgeHammer created KAFKA-9735:
---

 Summary: Kafka Connect REST endpoint doesn't consider 
keystore/truststore
 Key: KAFKA-9735
 URL: https://issues.apache.org/jira/browse/KAFKA-9735
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Affects Versions: 2.4.0
 Environment: Windows 10 Pro x64
Reporter: SledgeHammer


Trying to secure the Kafka Connect REST endpoint with SSL fails (no cipher 
suites in common):

listeners.https.ssl.client.auth=required
listeners.https.ssl.truststore.location=/progra~1/kafka_2.12-2.4.0/config/kafka.truststore.jks
listeners.https.ssl.truststore.password=xxx
listeners.https.ssl.keystore.location=/progra~1/kafka_2.12-2.4.0/config/kafka01.keystore.jks
listeners.https.ssl.keystore.password=xxx
listeners.https.ssl.key.password=xxx
listeners.https.ssl.enabled.protocols=TLSv1.2

 

See: 
[https://stackoverflow.com/questions/55220602/unable-to-configure-ssl-for-kafka-connect-rest-api]

 

Developer debugged and saw the configs are not being returned. I am still 
seeing this issue in 2.4.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9724) Consumer wrongly ignores fetched records "since it no longer has valid position"

2020-03-19 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062787#comment-17062787
 ] 

Jason Gustafson commented on KAFKA-9724:


[~o.muravskiy] Thanks for the report. Could you include your consumer 
configuration?

> Consumer wrongly ignores fetched records "since it no longer has valid 
> position"
> 
>
> Key: KAFKA-9724
> URL: https://issues.apache.org/jira/browse/KAFKA-9724
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 2.4.0
>Reporter: Oleg Muravskiy
>Priority: Major
>
> After upgrading kafka-client to 2.4.0 (while brokers are still at 2.2.0) 
> consumers in a consumer group intermittently stop progressing on assigned 
> partitions, even when there are messages to consume. This is not a permanent 
> condition, as they progress from time to time, but become slower with time, 
> and catch up after restart.
> Here is a sample of 3 consecutive ignored fetches:
> {noformat}
> 2020-03-15 12:08:58,440 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,541 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58,549 DEBUG [Thread-6] org.apache.kafka.clients.Metadata - 
> Updating last seen epoch from null to 62 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,557 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,652 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58,659 DEBUG [Thread-6] org.apache.kafka.clients.Metadata - 
> Updating last seen epoch from null to 62 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,659 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Fetch READ_UNCOMMITTED at offset 538065584 for partition mrt-rrc10-6 returned 
> fetch data (error=NONE, highWaterMark=538065631, lastStableOffset = 
> 538065631, logStartOffset = 485284547, preferredReadReplica = absent, 
> abortedTransactions = null, recordsSizeInBytes=16380)
> 2020-03-15 12:08:58,659 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Ignoring fetched records for partition mrt-rrc10-6 since it no longer has 
> valid position
> 2020-03-15 12:08:58,665 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,761 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58,761 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Added READ_UNCOMMITTED fetch request for partition mrt-rrc10-6 at position 
> FetchPosition{offset=538065584, offsetEpoch=Optional[62], 
> currentLeader=LeaderAndEpoch{leader=node03.kafka:9092 (id: 3 rack: null), 
> epoch=-1}} to node node03.kafka:9092 (id: 3 rack: null)
> 2020-03-15 12:08:58,761 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), 
> implied=(mrt-rrc10-6, mrt-rrc22-7, mrt-rrc10-1)) to broker node03.kafka:9092 
> (id: 3 rack: null)
> 2020-03-15 12:08:58,770 DEBUG [Thread-6] org.apache.kafka.clients.Metadata - 
> Updating last seen epoch from null to 62 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,770 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Fetch READ_UNCOMMITTED at offset 538065584 for partition mrt-rrc10-6 returned 
> fetch data (error=NONE, highWaterMark=538065727, lastStableOffset = 
> 538065727, logStartOffset = 485284547, preferredReadReplica = absent, 
> abortedTransactions = null, recordsSizeInBytes=51864)
> 2020-03-15 12:08:58,770 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Ignoring fetched records for partition mrt-rrc10-6 since it no longer has 
> valid position
> 2020-03-15 12:08:58,808 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> {noformat}
> After which consumer makes progress:
> {noformat}
> 2020-03-15 12:08:58,871 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> p

[jira] [Commented] (KAFKA-9691) Flaky test kafka.admin.TopicCommandWithAdminClientTest#testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062756#comment-17062756
 ] 

ASF GitHub Bot commented on KAFKA-9691:
---

tombentley commented on pull request #8317: KAFKA-9691: Fix NPE by waiting for 
reassignment request
URL: https://github.com/apache/kafka/pull/8317
 
 
   It seems likely the NPE reported in KAFKA-9691 was due to the call to 
`alterPartitionReassignments()` returning but the reassignment request not 
being completed yet, so try to fix it by calling `get()` on the returned 
`KafkaFuture`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky test 
> kafka.admin.TopicCommandWithAdminClientTest#testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress
> 
>
> Key: KAFKA-9691
> URL: https://issues.apache.org/jira/browse/KAFKA-9691
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Affects Versions: 2.5.0
>Reporter: Tom Bentley
>Priority: Major
>  Labels: flaky-test
>
> Stacktrace:
> {noformat}
> java.lang.NullPointerException
>   at 
> kafka.admin.TopicCommandWithAdminClientTest.$anonfun$testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress$3(TopicCommandWithAdminClientTest.scala:673)
>   at 
> kafka.admin.TopicCommandWithAdminClientTest.testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress(TopicCommandWithAdminClientTest.scala:671)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base

[jira] [Commented] (KAFKA-9724) Consumer wrongly ignores fetched records "since it no longer has valid position"

2020-03-19 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062748#comment-17062748
 ] 

Ismael Juma commented on KAFKA-9724:


cc [~hachikuji] [~bchen225242]

> Consumer wrongly ignores fetched records "since it no longer has valid 
> position"
> 
>
> Key: KAFKA-9724
> URL: https://issues.apache.org/jira/browse/KAFKA-9724
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 2.4.0
>Reporter: Oleg Muravskiy
>Priority: Major
>
> After upgrading kafka-client to 2.4.0 (while brokers are still at 2.2.0) 
> consumers in a consumer group intermittently stop progressing on assigned 
> partitions, even when there are messages to consume. This is not a permanent 
> condition, as they progress from time to time, but become slower with time, 
> and catch up after restart.
> Here is a sample of 3 consecutive ignored fetches:
> {noformat}
> 2020-03-15 12:08:58,440 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,541 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58,549 DEBUG [Thread-6] org.apache.kafka.clients.Metadata - 
> Updating last seen epoch from null to 62 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,557 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,652 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58,659 DEBUG [Thread-6] org.apache.kafka.clients.Metadata - 
> Updating last seen epoch from null to 62 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,659 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Fetch READ_UNCOMMITTED at offset 538065584 for partition mrt-rrc10-6 returned 
> fetch data (error=NONE, highWaterMark=538065631, lastStableOffset = 
> 538065631, logStartOffset = 485284547, preferredReadReplica = absent, 
> abortedTransactions = null, recordsSizeInBytes=16380)
> 2020-03-15 12:08:58,659 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Ignoring fetched records for partition mrt-rrc10-6 since it no longer has 
> valid position
> 2020-03-15 12:08:58,665 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,761 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58,761 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Added READ_UNCOMMITTED fetch request for partition mrt-rrc10-6 at position 
> FetchPosition{offset=538065584, offsetEpoch=Optional[62], 
> currentLeader=LeaderAndEpoch{leader=node03.kafka:9092 (id: 3 rack: null), 
> epoch=-1}} to node node03.kafka:9092 (id: 3 rack: null)
> 2020-03-15 12:08:58,761 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), 
> implied=(mrt-rrc10-6, mrt-rrc22-7, mrt-rrc10-1)) to broker node03.kafka:9092 
> (id: 3 rack: null)
> 2020-03-15 12:08:58,770 DEBUG [Thread-6] org.apache.kafka.clients.Metadata - 
> Updating last seen epoch from null to 62 for partition mrt-rrc10-6
> 2020-03-15 12:08:58,770 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Fetch READ_UNCOMMITTED at offset 538065584 for partition mrt-rrc10-6 returned 
> fetch data (error=NONE, highWaterMark=538065727, lastStableOffset = 
> 538065727, logStartOffset = 485284547, preferredReadReplica = absent, 
> abortedTransactions = null, recordsSizeInBytes=51864)
> 2020-03-15 12:08:58,770 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Ignoring fetched records for partition mrt-rrc10-6 since it no longer has 
> valid position
> 2020-03-15 12:08:58,808 DEBUG [Thread-6] o.a.k.c.c.i.ConsumerCoordinator - 
> Committed offset 538065584 for partition mrt-rrc10-6
> {noformat}
> After which consumer makes progress:
> {noformat}
> 2020-03-15 12:08:58,871 DEBUG [Thread-6] o.a.k.c.consumer.internals.Fetcher - 
> Skipping validation of fetch offsets for partitions [mrt-rrc10-1, 
> mrt-rrc10-6, mrt-rrc22-7] since the broker does not support the required 
> protocol version (introduced in Kafka 2.3)
> 2020-03-15 12:08:58

[jira] [Updated] (KAFKA-9583) OffsetsForLeaderEpoch requests are sometimes not sent to the leader of partition

2020-03-19 Thread Ismael Juma (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-9583:
---
Fix Version/s: 2.5.1
   2.6.0

> OffsetsForLeaderEpoch requests are sometimes not sent to the leader of 
> partition
> 
>
> Key: KAFKA-9583
> URL: https://issues.apache.org/jira/browse/KAFKA-9583
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.4.0, 2.3.1
>Reporter: Andy Fang
>Priority: Minor
>  Labels: newbie, patch, pull-request-available
> Fix For: 2.6.0, 2.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In 
> [{{validateOffsetsAsync}}|https://github.com/apache/kafka/blob/2.3.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L737],
>  we group the requests by leader node for efficiency. The list of 
> topic-partitions are grouped from {{partitionsToValidate}} (all partitions) 
> to {{node}} => [{{fetchPostitions}} (partitions by 
> node)|https://github.com/apache/kafka/blob/2.3.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L739].
> However, when actually sending the request with 
> {{OffsetsForLeaderEpochClient}}, we use 
> [{{partitionsToValidate}}|https://github.com/apache/kafka/blob/2.3.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L765],
>  which is the list of all topic-partitions passed into 
> {{validateOffsetsAsync}}. This results in extra partitions being included in 
> the request sent to brokers that are potentially not the leader for those 
> partitions.
> I have submitted a PR, 
> [https://github.com/apache/kafka/pull/8077|https://github.com/apache/kafka/pull/8077],
>  that fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9723) Consumers ignore fetched records "since it no longer has valid position"

2020-03-19 Thread Ismael Juma (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma resolved KAFKA-9723.

Resolution: Duplicate

Duplicate of KAFKA-9724

> Consumers ignore fetched records "since it no longer has valid position"
> 
>
> Key: KAFKA-9723
> URL: https://issues.apache.org/jira/browse/KAFKA-9723
> Project: Kafka
>  Issue Type: Bug
>Reporter: Oleg Muravskiy
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9729) Shrink inWriteLock time in SimpleAuthorizer

2020-03-19 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062736#comment-17062736
 ] 

Ismael Juma commented on KAFKA-9729:


Thanks for the report. Related: [https://github.com/apache/kafka/pull/7882]

> Shrink inWriteLock time in SimpleAuthorizer
> ---
>
> Key: KAFKA-9729
> URL: https://issues.apache.org/jira/browse/KAFKA-9729
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 1.1.0
>Reporter: Jiao Zhang
>Priority: Minor
>
> Current SimpleAuthorizer needs 'inWriteLock' when processing add/remove acls 
> requests, while getAcls in authorize() needs 'inReadLock'.
>  That means handling add/remove acls requests would block all other requests 
> for example produce and fetch requests.
>  When processing add/remove acls, updateResourceAcls() access zk to update 
> acls, which could be long in the case like network glitch.
>  We did the simulation for zk delay.
>  When adding 100ms delay on zk side, 'inWriteLock' in addAcls()/removeAcls 
> lasts for 400ms~500ms.
>  When adding 500ms delay on zk side, 'inWriteLock' in addAcls()/removeAcls 
> lasts for 2000ms~2500ms.
> {code:java}
> override def addAcls(acls: Set[Acl], resource: Resource) {
>   if (acls != null && acls.nonEmpty) {
> inWriteLock(lock) {
>   val startMs = Time.SYSTEM.milliseconds()
>   updateResourceAcls(resource) { currentAcls =>
> currentAcls ++ acls
>   }
>   warn(s"inWriteLock in addAcls consumes ${Time.SYSTEM.milliseconds() - 
> startMs} milliseconds.")
> }
>   }
> }{code}
> Blocking produce/fetch requests for 2s would cause apparent performance 
> degradation for the whole cluster.
>  So considering is it possible to remove 'inWriteLock' in addAcls/removeAcls 
> and only put 'inWriteLock' inside updateCache, which is called by 
> addAcls/removeAcls. 
> {code:java}
> // code placeholder
> private def updateCache(resource: Resource, versionedAcls: VersionedAcls) {
>  if (versionedAcls.acls.nonEmpty) {
>  aclCache.put(resource, versionedAcls)
>  } else {
>  aclCache.remove(resource)
>  }
>  }
> {code}
> If do this, block time is only the time for updating local cache, which isn't 
> influenced by network glitch. But don't know if there were special concerns 
> to have current strict write lock and not sure if there are side effects if 
> only put lock to updateCache.
> Btw, the latest version uses 'inWriteLock' at same places as version 1.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9677) Low consume bandwidth quota may cause consumer not being able to fetch data

2020-03-19 Thread Ismael Juma (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma resolved KAFKA-9677.

Fix Version/s: 2.6.0
   Resolution: Fixed

> Low consume bandwidth quota may cause consumer not being able to fetch data
> ---
>
> Key: KAFKA-9677
> URL: https://issues.apache.org/jira/browse/KAFKA-9677
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.1, 2.1.1, 2.2.2, 2.4.0, 2.3.1
>Reporter: Anna Povzner
>Assignee: Anna Povzner
>Priority: Major
> Fix For: 2.6.0
>
>
> When we changed quota communication with KIP-219, fetch requests get 
> throttled by returning empty response with the delay in `throttle_time_ms` 
> and Kafka consumer retrying again after the delay. 
> With default configs, the maximum fetch size could be as big as 50MB (or 10MB 
> per partition). The default broker config (1-second window, 10 full windows 
> of tracked bandwidth/thread utilization usage) means that < 5MB/s consumer 
> quota (per broker) may stop fetch request from ever being successful.
> Or the other way around: 1 MB/s consumer quota (per broker) means that any 
> fetch request that gets >= 10MB of data (10 seconds * 1MB/second) in the 
> response will never get through. From consumer point of view, the behavior 
> will be: Consumer will get an empty response with throttle_time_ms > 0, Kafka 
> consumer will wait for throttle time delay and then send fetch request again, 
> the fetch response is still too big so broker sends another empty response 
> with throttle time, and so on in never ending loop
> h3. Proposed fix
> Return less data in fetch response in this case: Cap `fetchMaxBytes` passed 
> to replicaManager.fetchMessages() from KafkaApis.handleFetchRequest() to 
>  * . In the example of default 
> configs and 1MB/s consumer bandwidth quota, fetchMaxBytes will be 10MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9722) Kafka stops consuming messages after error

2020-03-19 Thread Artem Loginov (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Loginov resolved KAFKA-9722.
--
Resolution: Invalid

> Kafka stops consuming messages after error
> --
>
> Key: KAFKA-9722
> URL: https://issues.apache.org/jira/browse/KAFKA-9722
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.1
>Reporter: Artem Loginov
>Priority: Major
>
> Hello,
> I have an issue with kafka in my app. It suddenly stops consuming new 
> messages and I don't know what to do.
>  I've posted an issue on spring-cloud-kafka github but they kindly redirected 
> me to the kafka project saying this is nothing to do with spring.
>  Here are the issue 
> [https://github.com/spring-cloud/spring-cloud-stream/issues/1928#event-3127969583]
>  Please help since my app could only work for couple hours in production env 
> and then it just stop doing the work before I restart it. I can provide any 
> additional configuration you need.
> Thank you for help.
> UPD (_16 Mar 2020)_:
> According to [~bchen225242] comment here is the summarized error description.
>  After some time I receive a lot of: 
> {code:java}
> Offset commit failed on partition topic.name-57 at offset 31282126: The 
> request timed out.
> {code}
> {{and a couple of }}
> {code:java}
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
> not the leader for that topic-partition{code}
> after which my app stop consuming from kafka topics. But continue to send 
> messages.
>  Regarding the config, I basically use 
> {color:#172b4d}{color:#6a8759}spring-cloud-starter-stream-kafka defaults with 
> this additional properties{color}{color}
> {code:java}
> spring.cloud.stream.bindings.inputEvent.destination=topic-2.name
> spring.cloud.stream.bindings.inputEvent.group=${spring.application.name}
> spring.cloud.stream.bindings.outputEvent.destination=topic-2.name
> spring.cloud.stream.bindings.inputTopicName.destination=topic.name
> spring.cloud.stream.bindings.inputTopicName.group=${spring.application.name}
> spring.cloud.stream.bindings.outputTopicName.destination=topic.name
> spring.kafka.consumer.group-id=${spring.application.name}
> spring.kafka.consumer.auto-offset-reset=earliest
> spring.kafka.bootstrap-servers=kafka-bootstrap-servver:9092
> spring.cloud.stream.kafka.binder.producer-properties.key.serializer=org.apache.kafka.common.serialization.StringSerializer
> {code}
> {color:#6a8759}I am honestly not sure if this is the config you are looking 
> for. Let me know if you need anything extra and I will do my best to provide 
> required information.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9734) Streams IllegalStateException in trunk during rebalance

2020-03-19 Thread John Roesler (Jira)
John Roesler created KAFKA-9734:
---

 Summary: Streams IllegalStateException in trunk during rebalance
 Key: KAFKA-9734
 URL: https://issues.apache.org/jira/browse/KAFKA-9734
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: John Roesler
Assignee: John Roesler


I have observed the following exception to kill a thread in the current trunk 
of Streams:
{noformat}
[2020-03-19 07:04:35,206] ERROR 
[stream-soak-test-e60443b4-aa2d-4107-abf7-ce90407cb70e-StreamThread-1] 
stream-thread [stream-soak-test-e60443b4-aa
2d-4107-abf7-ce90407cb70e-StreamThread-1] Encountered the following exception 
during processing and the thread is going to shut down:  
(org.apache.kafka.streams.processor.internals.StreamThread)
java.lang.IllegalStateException: The changelog reader is not restoring active 
tasks while trying to transit to update standby tasks: {}
at 
org.apache.kafka.streams.processor.internals.StoreChangelogReader.transitToUpdateStandby(StoreChangelogReader.java:303)

 
at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:582)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:498)
at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:472){noformat}
Judging from the fact that the standby tasks are reported as an empty set, I 
think this is just a missed case in the task manager. PR to follow shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9435) Replace DescribeLogDirs request/response with automated protocol

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062677#comment-17062677
 ] 

ASF GitHub Bot commented on KAFKA-9435:
---

mimaison commented on pull request #7972: KAFKA-9435: DescribeLogDirs automated 
protocol
URL: https://github.com/apache/kafka/pull/7972
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace DescribeLogDirs request/response with automated protocol
> 
>
> Key: KAFKA-9435
> URL: https://issues.apache.org/jira/browse/KAFKA-9435
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9433) Replace AlterConfigs request/response with automated protocol

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062637#comment-17062637
 ] 

ASF GitHub Bot commented on KAFKA-9433:
---

tombentley commented on pull request #8315: KAFKA-9433: Use automated protocol 
for AlterConfigs request and response
URL: https://github.com/apache/kafka/pull/8315
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace AlterConfigs request/response with automated protocol
> -
>
> Key: KAFKA-9433
> URL: https://issues.apache.org/jira/browse/KAFKA-9433
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8618) Replace WriteTxnMarkers request/response with automated protocol

2020-03-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062630#comment-17062630
 ] 

ASF GitHub Bot commented on KAFKA-8618:
---

mimaison commented on pull request #7039: KAFKA-8618: Replace Txn marker with 
automated protocol
URL: https://github.com/apache/kafka/pull/7039
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace WriteTxnMarkers request/response with automated protocol
> 
>
> Key: KAFKA-8618
> URL: https://issues.apache.org/jira/browse/KAFKA-8618
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-1518) KafkaMetricsReporter prevents Kafka from starting if the custom reporter throws an exception

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/KAFKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau resolved KAFKA-1518.
-
Resolution: Not A Bug

Closed as this seems to be working as designed and long inactivity on ticket.

> KafkaMetricsReporter prevents Kafka from starting if the custom reporter 
> throws an exception
> 
>
> Key: KAFKA-1518
> URL: https://issues.apache.org/jira/browse/KAFKA-1518
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.1.1
>Reporter: Daniel Compton
>Priority: Major
>
> When Kafka starts up, it 
> [starts|https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/metrics/KafkaMetricsReporter.scala#L60]
>  custom metrics reporters. If any of these throw an exception on startup then 
> this will block Kafka from starting.
> For example using 
> [kafka-riemann-reporter|https://github.com/pingles/kafka-riemann-reporter], 
> if Riemann is not available then it will throw an [IO 
> Exception|https://github.com/pingles/kafka-riemann-reporter/issues/1] which 
> isn't caught by KafkaMetricsReporter. This means that Kafka will fail to 
> start until it can connect to Riemann, coupling a HA system to a non HA 
> system.
> It would probably be better to log the error and perhaps have a callback hook 
> to the reporter where they can handle it and startup in the background.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-1440) Per-request tracing

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/KAFKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau resolved KAFKA-1440.
-
Resolution: Won't Fix

Closed due to long inactivity and no comment upon inquiry.

> Per-request tracing
> ---
>
> Key: KAFKA-1440
> URL: https://issues.apache.org/jira/browse/KAFKA-1440
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Todd Palino
>Priority: Major
>
> Could we have a flag in requests (potentially in the header for all requests, 
> but at least for produce and fetch) that would enable tracing for that 
> request. Currently, if we want to debug an issue with a request, we need to 
> turn on trace level logging for the entire broker. If the client could ask 
> for the request to be traced, we could then log detailed information for just 
> that request.
> Ideally, this would be available as a flag that can be enabled in the client 
> via JMX, so it could be done without needing to restart the client 
> application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-1336) Create partition compaction analyzer

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/KAFKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau resolved KAFKA-1336.
-
Resolution: Won't Fix

Closed due to long inactivity and no comment upon inquiry.

> Create partition compaction analyzer
> 
>
> Key: KAFKA-1336
> URL: https://issues.apache.org/jira/browse/KAFKA-1336
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jay Kreps
>Priority: Major
>
> It would be nice to have a tool that given a topic and partition reads the 
> full data and outputs:
> 1. The percentage of records that are duplicates
> 2. The percentage of records that have been deleted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-1231) Support partition shrink (delete partition)

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/KAFKA-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau resolved KAFKA-1231.
-
Resolution: Won't Fix

Closed due to doubts regarding the feasibility of this ticket and long 
inactivity.

> Support partition shrink (delete partition)
> ---
>
> Key: KAFKA-1231
> URL: https://issues.apache.org/jira/browse/KAFKA-1231
> Project: Kafka
>  Issue Type: Wish
>  Components: core
>Affects Versions: 0.8.0
>Reporter: Marc Labbe
>Priority: Minor
>
> Topics may have to be sized down once its peak usage has been passed. It 
> would be interesting to be able to reduce the number of partitions for a 
> topic.
> In reference to http://search-hadoop.com/m/4TaT4hQiGt1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-1265) SBT and Gradle create jars without expected Maven files

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/KAFKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau resolved KAFKA-1265.
-
Resolution: Not A Problem

Closed as this seems to be covered by current functionality. Also long 
inactivity on this ticket and no comment after inquiry.

> SBT and Gradle create jars without expected Maven files
> ---
>
> Key: KAFKA-1265
> URL: https://issues.apache.org/jira/browse/KAFKA-1265
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.0, 0.8.1
>Reporter: Clark Breyman
>Priority: Minor
>  Labels: build
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The jar files produced and deployed to maven central do not embed the 
> expected Maven pom.xml and pom.properties files as would be expected by a 
> standard Maven-build artifact. This results in jars that do not self-document 
> their versions or dependencies. 
> For reference to the maven behavior, see addMavenDescriptor (defaults to 
> true): http://maven.apache.org/shared/maven-archiver/#archive
> Worst case, these files would need to be generated by the build and included 
> in the jar file. Gradle doesn't seem to have an automatic mechanism for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-3919) Broker faills to start after ungraceful shutdown due to non-monotonically incrementing offsets in logs

2020-03-19 Thread zhangchenghui (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062331#comment-17062331
 ] 

zhangchenghui edited comment on KAFKA-3919 at 3/19/20, 6:59 AM:


Hi, [~junrao]
When the Kafka partition is unavailable and the leader copy is damaged, I try 
to minimize data loss by doing the following:
1. Partition reallocation of unavailable partitions;
2. Modify zk node: / brokers / topics / test-topic / partitions / 0 / state
3.Restart the broker
When a partition is unavailable, can I provide an option for users to manually 
set any copy in the partition as the leader?
Because once the cluster has set unclean.leader.election.enable = false, it is 
not possible to elect a copy other than ISR as the leader. In extreme cases, 
only the leader copy is still in the ISR. At this time, the broker where the 
leader is located is down. What if the broker data is corrupted? In this case, 
is it possible for the user to choose the leader replica himself? Although this 
will also result in data loss, the situation will be much better than data loss 
for the entire partition.
This is my summary:
https://mp.weixin.qq.com/s/b1etPGC97xNjmgdQbycnSg


was (Author: zhangchenghui):
Hi, [~junrao]
When the Kafka partition is unavailable and the leader copy is damaged, I try 
to minimize data loss by doing the following:
1. Partition reallocation of unavailable partitions;
2. Modify zk node: / brokers / topics / test-topic / partitions / 0 / state
3.Restart the broker
When a partition is unavailable, can I provide an option for users to manually 
set any copy in the partition as the leader?
Because once the cluster has set unclean.leader.election.enable = false, it is 
not possible to elect a copy other than ISR as the leader. In extreme cases, 
only the leader copy is still in the ISR. At this time, the broker where the 
leader is located is down. What if the broker data is corrupted? In this case, 
is it possible for the user to choose the leader replica himself? Although this 
will also result in data loss, the situation will be much better than data loss 
for the entire partition.

> Broker faills to start after ungraceful shutdown due to non-monotonically 
> incrementing offsets in logs
> --
>
> Key: KAFKA-3919
> URL: https://issues.apache.org/jira/browse/KAFKA-3919
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.1
>Reporter: Andy Coates
>Assignee: Ben Stopford
>Priority: Major
>  Labels: reliability
> Fix For: 0.11.0.0
>
>
> Hi All,
> I encountered an issue with Kafka following a power outage that saw a 
> proportion of our cluster disappear. When the power came back on several 
> brokers halted on start up with the error:
> {noformat}
>   Fatal error during KafkaServerStartable startup. Prepare to shutdown”
>   kafka.common.InvalidOffsetException: Attempt to append an offset 
> (1239742691) to position 35728 no larger than the last offset appended 
> (1239742822) to 
> /data3/kafka/mt_xp_its_music_main_itsevent-20/001239444214.index.
>   at 
> kafka.log.OffsetIndex$$anonfun$append$1.apply$mcV$sp(OffsetIndex.scala:207)
>   at kafka.log.OffsetIndex$$anonfun$append$1.apply(OffsetIndex.scala:197)
>   at kafka.log.OffsetIndex$$anonfun$append$1.apply(OffsetIndex.scala:197)
>   at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
>   at kafka.log.OffsetIndex.append(OffsetIndex.scala:197)
>   at kafka.log.LogSegment.recover(LogSegment.scala:188)
>   at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:188)
>   at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:160)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at kafka.log.Log.loadSegments(Log.scala:160)
>   at kafka.log.Log.(Log.scala:90)
>   at 
> kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$10$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:150)
>   at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:60)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang

[jira] [Commented] (KAFKA-3919) Broker faills to start after ungraceful shutdown due to non-monotonically incrementing offsets in logs

2020-03-19 Thread zhangchenghui (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062331#comment-17062331
 ] 

zhangchenghui commented on KAFKA-3919:
--

Hi, [~junrao]
When the Kafka partition is unavailable and the leader copy is damaged, I try 
to minimize data loss by doing the following:
1. Partition reallocation of unavailable partitions;
2. Modify zk node: / brokers / topics / test-topic / partitions / 0 / state
3.Restart the broker
When a partition is unavailable, can I provide an option for users to manually 
set any copy in the partition as the leader?
Because once the cluster has set unclean.leader.election.enable = false, it is 
not possible to elect a copy other than ISR as the leader. In extreme cases, 
only the leader copy is still in the ISR. At this time, the broker where the 
leader is located is down. What if the broker data is corrupted? In this case, 
is it possible for the user to choose the leader replica himself? Although this 
will also result in data loss, the situation will be much better than data loss 
for the entire partition.

> Broker faills to start after ungraceful shutdown due to non-monotonically 
> incrementing offsets in logs
> --
>
> Key: KAFKA-3919
> URL: https://issues.apache.org/jira/browse/KAFKA-3919
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.1
>Reporter: Andy Coates
>Assignee: Ben Stopford
>Priority: Major
>  Labels: reliability
> Fix For: 0.11.0.0
>
>
> Hi All,
> I encountered an issue with Kafka following a power outage that saw a 
> proportion of our cluster disappear. When the power came back on several 
> brokers halted on start up with the error:
> {noformat}
>   Fatal error during KafkaServerStartable startup. Prepare to shutdown”
>   kafka.common.InvalidOffsetException: Attempt to append an offset 
> (1239742691) to position 35728 no larger than the last offset appended 
> (1239742822) to 
> /data3/kafka/mt_xp_its_music_main_itsevent-20/001239444214.index.
>   at 
> kafka.log.OffsetIndex$$anonfun$append$1.apply$mcV$sp(OffsetIndex.scala:207)
>   at kafka.log.OffsetIndex$$anonfun$append$1.apply(OffsetIndex.scala:197)
>   at kafka.log.OffsetIndex$$anonfun$append$1.apply(OffsetIndex.scala:197)
>   at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
>   at kafka.log.OffsetIndex.append(OffsetIndex.scala:197)
>   at kafka.log.LogSegment.recover(LogSegment.scala:188)
>   at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:188)
>   at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:160)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at kafka.log.Log.loadSegments(Log.scala:160)
>   at kafka.log.Log.(Log.scala:90)
>   at 
> kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$10$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:150)
>   at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:60)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The only way to recover the brokers was to delete the log files that 
> contained non monotonically incrementing offsets.
> We've spent some time digging through the logs and I feel I may have worked 
> out the sequence of events leading to this issue, (though this is based on 
> some assumptions I've made about the way Kafka is working, which may be 
> wrong).
> First off, we have unclean leadership elections disable. (We did later enable 
> them to help get around some other issues we were having, but this was 
> several hours after this issue manifested), and we're producing to the topic 
> with gzip compression and acks=1
> We looked through the data logs that were causing the brokers to not start. 
> What we found the initial part of the log has monotonically increasing 
> offset, where each compressed batch normally contained one or two records. 
> Then the is a batch that contains many records, whose first records have an 
> offset below the previous batch and whose last record has an offset above the 
>