date:20200311

[jira] [Commented] (KAFKA-9225) kafka fail to run on linux-aarch64

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057644#comment-17057644
 ] 

ASF GitHub Bot commented on KAFKA-9225:
---

jiameixie commented on pull request #8284: KAFKA-9225: rocksdb 5.18.3 to 5.18.4
URL: https://github.com/apache/kafka/pull/8284
 
 
   Bump rocksdb 5.18.3 to 5.18.4 that supports all platforms.
   Issues about this version are https://github.com/facebook/rocksdb/pull/6497
   and https://github.com/facebook/rocksdb/issues/6188
   
   Change-Id: I3febec8e36550edcb7f88839cc1e2b2a54984564
   Signed-off-by: Jiamei Xie 
   
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> kafka fail to run on linux-aarch64
> --
>
> Key: KAFKA-9225
> URL: https://issues.apache.org/jira/browse/KAFKA-9225
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: jiamei xie
>Priority: Blocker
>  Labels: incompatible
> Fix For: 3.0.0
>
> Attachments: compat_report.html
>
>
> *Steps to reproduce:*
> 1. Download Kafka latest source code
> 2. Build it with gradle
> 3. Run 
> [streamDemo|[https://kafka.apache.org/23/documentation/streams/quickstart]]
> when running bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo, it crashed with 
> the following error message
> {code:java}
> xjm@ubuntu-arm01:~/kafka$bin/kafka-run-class.sh 
> org.apache.kafka.streams.examples.wordcount.WordCountDemo
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/core/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/tools/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/api/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/transforms/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/runtime/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/file/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/mirror-client/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/json/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/xjm/kafka/connect/basic-auth-extension/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [2019-11-19 15:42:23,277] WARN The configuration 'admin.retries' was supplied 
> but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:23,278] WARN The configuration 'admin.retry.backoff.ms' was 
> supplied but isn't a known config. 
> (org.apache.kafka.clients.consumer.ConsumerConfig)
> [2019-11-19 15:42:24,278] ERROR stream-client 
> [streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48] All stream threads 
> have died. The instance will be in error state and should be closed. 
> (org.apach e.kafka.streams.KafkaStreams)
> Exception in thread 
> "streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754f

[jira] [Commented] (KAFKA-7983) supporting replication.throttled.replicas in dynamic broker configuration

2020-03-11 Thread Cheng Tan (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-7983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057634#comment-17057634
 ] 

Cheng Tan commented on KAFKA-7983:
--

[https://github.com/apache/kafka/pull/8283]

> supporting replication.throttled.replicas in dynamic broker configuration
> -
>
> Key: KAFKA-7983
> URL: https://issues.apache.org/jira/browse/KAFKA-7983
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Reporter: Jun Rao
>Assignee: Cheng Tan
>Priority: Major
>
> In 
> [KIP-226|https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration#KIP-226-DynamicBrokerConfiguration-DefaultTopicconfigs],
>  we added the support to change broker defaults dynamically. However, it 
> didn't support changing leader.replication.throttled.replicas and 
> follower.replication.throttled.replicas. These 2 configs were introduced in 
> [KIP-73|https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas]
>  and controls the set of topic partitions on which replication throttling 
> will be engaged. One useful case is to be able to set a default value for 
> both configs to * to allow throttling to be engaged for all topic partitions. 
> Currently, the static default value for both configs are ignored for 
> replication throttling, it would be useful to fix that as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-7983) supporting replication.throttled.replicas in dynamic broker configuration

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-7983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057633#comment-17057633
 ] 

ASF GitHub Bot commented on KAFKA-7983:
---

d8tltanc commented on pull request #8283: [WIP] KAFKA-7983: supporting 
replication.throttled.replicas in dynamic broker configuration
URL: https://github.com/apache/kafka/pull/8283
 
 
   **More detailed description of your change**
   
   > In KIP-226, we added the support to change broker defaults dynamically. 
However, it didn't support changing leader.replication.throttled.replicas and 
follower.replication.throttled.replicas. These 2 configs were introduced in 
KIP-73 and controls the set of topic partitions on which replication throttling 
will be engaged. One useful case is to be able to set a default value for both 
configs to * to allow throttling to be engaged for all topic partitions. 
Currently, the static default value for both configs are ignored for 
replication throttling, it would be useful to fix that as well.
   
   > leader.replication.throttled.replicas and 
follower.replication.throttled.replicas are dynamically set through 
ReplicationQuotaManager.markThrottled() at the topic level. However, these two 
properties don't exist at the broker level config and BrokerConfigHandler 
doesn't call ReplicationQuotaManager.markThrottled(). So, currently, we can't 
set leader.replication.throttled.replicas and 
follower.replication.throttled.replicas at the broker level either statically 
or dynamically.
   
   In this patch, we introduced two new dynamic broker configs, both of them 
are type of boolean:
   
   "leader.replication.throttled" (default: false)
   
   "follower.replication.throttled" (default: false)
   
   If "leader.replication.throttled" is set to "true", all leader brokers will 
be throttled. Similarly, if "follower.replication.throttled" is set to "true", 
all follower brokers will be throttled. The throttle mechanism is introduced in 
KIP-73. 
   
   To implement the broker level throttle, I added a new class variable to 
ReplicationQuotaManager. The BrokerConfigHandler will call 
updateBrokerThrottle() and update this class variable upon receiving the config 
change notification from ZooKeeper. ReplicationQuotaManager::isThrottled()
   
   *Summary of testing strategy (including rationale)
   
   Added 
ReplicationQuotaManagerTest::shouldBrokerLevelThrottleAffectAllTopicPartition() 
to test if all topic partitions will be throttled when the broker is throttled.
   
   I'm currently working on adding more tests.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> supporting replication.throttled.replicas in dynamic broker configuration
> -
>
> Key: KAFKA-7983
> URL: https://issues.apache.org/jira/browse/KAFKA-7983
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Reporter: Jun Rao
>Assignee: Cheng Tan
>Priority: Major
>
> In 
> [KIP-226|https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration#KIP-226-DynamicBrokerConfiguration-DefaultTopicconfigs],
>  we added the support to change broker defaults dynamically. However, it 
> didn't support changing leader.replication.throttled.replicas and 
> follower.replication.throttled.replicas. These 2 configs were introduced in 
> [KIP-73|https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas]
>  and controls the set of topic partitions on which replication throttling 
> will be engaged. One useful case is to be able to set a default value for 
> both configs to * to allow throttling to be engaged for all topic partitions. 
> Currently, the static default value for both configs are ignored for 
> replication throttling, it would be useful to fix that as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9709) add a confirm when kafka-server-stop.sh find multiple kafka instances to kill

2020-03-11 Thread qiang Liu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057519#comment-17057519
 ] 

qiang Liu commented on KAFKA-9709:
--

create a pull request [https://github.com/apache/kafka/pull/8275] which simply 
add a confirm. 

but as @rondagostino pointed out in that pull request, this change existing 
behavior, so maybe we should done this in a better way.

thinking up some , list them below
 # add a delay like 5 seconds,  if no user interaction, do kill all just as 
before, may add a arg like killall to skip the wait
 # create another script name it like kafka-server-stop-single.sh which will 
not do the kill when find multiple instance

> add a confirm when kafka-server-stop.sh find multiple kafka instances to kill
> -
>
> Key: KAFKA-9709
> URL: https://issues.apache.org/jira/browse/KAFKA-9709
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 2.4.1
>Reporter: qiang Liu
>Priority: Minor
>
> currently kafka-server-stop.sh find all kafka instances on the machine and 
> kill them all with out any confirm, when deploy multity instance in one 
> machine, some may mistakenly kill all instance while only mean to kill single 
> one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9709) add a confirm when kafka-server-stop.sh find multiple kafka instances to kill

2020-03-11 Thread qiang Liu (Jira)

qiang Liu created KAFKA-9709:


 Summary: add a confirm when kafka-server-stop.sh find multiple 
kafka instances to kill
 Key: KAFKA-9709
 URL: https://issues.apache.org/jira/browse/KAFKA-9709
 Project: Kafka
  Issue Type: Improvement
  Components: admin
Affects Versions: 2.4.1
Reporter: qiang Liu


currently kafka-server-stop.sh find all kafka instances on the machine and kill 
them all with out any confirm, when deploy multity instance in one machine, 
some may mistakenly kill all instance while only mean to kill single one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-5972) Flatten SMT does not work with null values

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057468#comment-17057468
 ] 

ASF GitHub Bot commented on KAFKA-5972:
---

rhauch commented on pull request #4021: KAFKA-5972 Flatten SMT does not work 
with null values
URL: https://github.com/apache/kafka/pull/4021
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flatten SMT does not work with null values
> --
>
> Key: KAFKA-5972
> URL: https://issues.apache.org/jira/browse/KAFKA-5972
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tomas Zuklys
>Assignee: siva santhalingam
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.0.3, 1.1.2, 2.0.2, 2.1.2, 2.2.2, 2.4.0, 2.3.1
>
> Attachments: kafka-transforms.patch
>
>
> Hi,
> I noticed a bug in Flatten SMT while doing tests with different SMTs that are 
> provided out-of-box.
> Flatten SMT does not work as expected with schemaless JSON that has 
> properties with null values. 
> Example json:
> {code}
>   {A={D=dValue, B=null, C=cValue}}
> {code}
> The issue is in if statement that checks for null value.
> Current version:
> {code}
>   for (Map.Entry entry : originalRecord.entrySet()) {
> final String fieldName = fieldName(fieldNamePrefix, 
> entry.getKey());
> Object value = entry.getValue();
> if (value == null) {
> newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), 
> null);
> return;
> }
> ...
> {code}
> should be
> {code}
>   for (Map.Entry entry : originalRecord.entrySet()) {
> final String fieldName = fieldName(fieldNamePrefix, 
> entry.getKey());
> Object value = entry.getValue();
> if (value == null) {
> newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), 
> null);
> continue;
> }
> {code}
> I have attached a patch containing the fix for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9701) Consumer could catch InconsistentGroupProtocolException during rebalance

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057459#comment-17057459
 ] 

ASF GitHub Bot commented on KAFKA-9701:
---

guozhangwang commented on pull request #8272: KAFKA-9701: Add more debug log on 
client to reproduce the issue
URL: https://github.com/apache/kafka/pull/8272
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consumer could catch InconsistentGroupProtocolException during rebalance
> 
>
> Key: KAFKA-9701
> URL: https://issues.apache.org/jira/browse/KAFKA-9701
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> INFO log shows that we accidentally hit an unexpected inconsistent group 
> protocol exception:
> [2020-03-10T17:16:53-07:00] 
> (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) 
> [2020-03-11 *00:16:53,382*] INFO 
> [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] 
> stream-client [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949] State 
> transition from REBALANCING to RUNNING (org.apache.kafka.streams.KafkaStreams)
>  
> [2020-03-10T17:16:53-07:00] 
> (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) 
> [2020-03-11 *00:16:53,384*] WARN [kafka-producer-network-thread | 
> stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1-0_1-producer]
>  stream-thread 
> [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] task 
> [0_1] Error sending record to topic node-name-repartition due to Producer 
> attempted an operation with an old epoch. Either there is a newer producer 
> with the same transactionalId, or the producer's transaction has been expired 
> by the broker.; No more records will be sent and no more offsets will be 
> recorded for this task.
>  
>  
> [2020-03-10T17:16:53-07:00] 
> (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) 
> [2020-03-11 *00:16:53,521*] INFO 
> [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] 
> [Consumer 
> clientId=stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1-consumer,
>  groupId=stream-soak-test] Member 
> stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1-consumer-d1c3c796-0bfb-4c1c-9fb4-5a807d8b53a2
>  sending LeaveGroup request to coordinator 
> ip-172-31-20-215.us-west-2.compute.internal:9092 (id: 2147482646 rack: null) 
> due to the consumer unsubscribed from all topics 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
>  
> [2020-03-10T17:16:54-07:00] 
> (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) 
> [2020-03-11 *00:16:53,798*] ERROR 
> [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] 
> stream-thread 
> [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] 
> Encountered the following unexpected Kafka exception during processing, this 
> usually indicate Streams internal errors: 
> (org.apache.kafka.streams.processor.internals.StreamThread)
> [2020-03-10T17:16:54-07:00] 
> (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) 
> org.apache.kafka.common.errors.InconsistentGroupProtocolException: The group 
> member's supported protocols are incompatible with those of existing members 
> or first group member tried to join with empty protocol type or empty 
> protocol list.
>  
> Potentially needs further log to understand this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-6145) Warm up new KS instances before migrating tasks - potentially a two phase rebalance

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057452#comment-17057452
 ] 

ASF GitHub Bot commented on KAFKA-6145:
---

ableegoldman commented on pull request #8282: KAFKA-6145: add new assignment 
configs
URL: https://github.com/apache/kafka/pull/8282
 
 
   For KIP-441 we intend to add 4 new configs:
   
   1. assignment.acceptable.recovery.lag
   2. assignment.balance.factor
   3. assignment.max.extra.replicas
   4. assignment.probing.rebalance.interval.ms
   
   I think we should give them all a common prefix to make it clear they're 
related, but am open to better suggestions than just "assignment"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Warm up new KS instances before migrating tasks - potentially a two phase 
> rebalance
> ---
>
> Key: KAFKA-6145
> URL: https://issues.apache.org/jira/browse/KAFKA-6145
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Antony Stubbs
>Assignee: Sophie Blee-Goldman
>Priority: Major
>  Labels: needs-kip
>
> Currently when expanding the KS cluster, the new node's partitions will be 
> unavailable during the rebalance, which for large states can take a very long 
> time, or for small state stores even more than a few ms can be a deal breaker 
> for micro service use cases.
> One workaround would be two execute the rebalance in two phases:
> 1) start running state store building on the new node
> 2) once the state store is fully populated on the new node, only then 
> rebalance the tasks - there will still be a rebalance pause, but would be 
> greatly reduced
> Relates to: KAFKA-6144 - Allow state stores to serve stale reads during 
> rebalance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-9605) EOS Producer could throw illegal state if trying to complete a failed batch after fatal error

2020-03-11 Thread Jason Gustafson (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-9605.

Resolution: Fixed

> EOS Producer could throw illegal state if trying to complete a failed batch 
> after fatal error
> -
>
> Key: KAFKA-9605
> URL: https://issues.apache.org/jira/browse/KAFKA-9605
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
> Fix For: 2.6.0
>
>
> In the Producer we could see network client hits fatal exception while trying 
> to complete the batches after Txn manager hits fatal fenced error:
> {code:java}
>  
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,673] ERROR [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer,
>  transactionalId=stream-soak-test-1_0] Aborting producer batches due to fatal 
> error (org.apache.kafka.clients.producer.internals.Sender)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) 
> org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an 
> operation with an old epoch. Either there is a newer producer with the same 
> transactionalId, or the producer's transaction has been expired by the broker.
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,674] INFO 
> [stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3] 
> [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-0_0-producer,
>  transactionalId=stream-soak-test-0_0] Closing the Kafka producer with 
> timeoutMillis = 9223372036854775807 ms. 
> (org.apache.kafka.clients.producer.KafkaProducer)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,684] INFO [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer,
>  transactionalId=stream-soak-test-1_0] Resetting sequence number of batch 
> with current sequence 354277 for partition windowed-node-counts-0 to 354276 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,684] INFO [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  Resetting sequence number of batch with current sequence 354277 for 
> partition windowed-node-counts-0 to 354276 
> (org.apache.kafka.clients.producer.internals.ProducerBatch)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,685] ERROR [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer,
>  transactionalId=stream-soak-test-1_0] Uncaught error in request completion: 
> (org.apache.kafka.clients.NetworkClient)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) 
> java.lang.IllegalStateException: Should not reopen a batch which is already 
> aborted.
>         at 
> org.apache.kafka.common.record.MemoryRecordsBuilder.reopenAndRewriteProducerState(MemoryRecordsBuilder.java:295)
>         at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.resetProducerState(ProducerBatch.java:395)
>         at 
> org.apache.kafka.clients.producer.internals.TransactionManager.lambda$adjustSequencesDueToFailedBatch$4(TransactionManager.java:770)
>         at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TopicPartitionEntry.resetSequenceNumbers(TransactionManager.java:180)
>         at 
> org.apache.kafka.clients.producer.internals.TransactionManager.adjustSequencesDueToFailedBatch(TransactionManager.java:760)
>         at 
> org.apache.kafka.clients.producer.internals.TransactionManager.handleFailedBatch(TransactionManager.java:735)
>         at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:671)
>         at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:662)
>         at 
> org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:620)

[jira] [Commented] (KAFKA-9605) EOS Producer could throw illegal state if trying to complete a failed batch after fatal error

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057443#comment-17057443
 ] 

ASF GitHub Bot commented on KAFKA-9605:
---

hachikuji commented on pull request #8177: KAFKA-9605: Do not attempt to abort 
batches when txn manager is in fatal error
URL: https://github.com/apache/kafka/pull/8177
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EOS Producer could throw illegal state if trying to complete a failed batch 
> after fatal error
> -
>
> Key: KAFKA-9605
> URL: https://issues.apache.org/jira/browse/KAFKA-9605
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
> Fix For: 2.6.0
>
>
> In the Producer we could see network client hits fatal exception while trying 
> to complete the batches after Txn manager hits fatal fenced error:
> {code:java}
>  
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,673] ERROR [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer,
>  transactionalId=stream-soak-test-1_0] Aborting producer batches due to fatal 
> error (org.apache.kafka.clients.producer.internals.Sender)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) 
> org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an 
> operation with an old epoch. Either there is a newer producer with the same 
> transactionalId, or the producer's transaction has been expired by the broker.
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,674] INFO 
> [stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3] 
> [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-0_0-producer,
>  transactionalId=stream-soak-test-0_0] Closing the Kafka producer with 
> timeoutMillis = 9223372036854775807 ms. 
> (org.apache.kafka.clients.producer.KafkaProducer)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,684] INFO [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer,
>  transactionalId=stream-soak-test-1_0] Resetting sequence number of batch 
> with current sequence 354277 for partition windowed-node-counts-0 to 354276 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,684] INFO [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  Resetting sequence number of batch with current sequence 354277 for 
> partition windowed-node-counts-0 to 354276 
> (org.apache.kafka.clients.producer.internals.ProducerBatch)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 
> 21:23:28,685] ERROR [kafka-producer-network-thread | 
> stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer]
>  [Producer 
> clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer,
>  transactionalId=stream-soak-test-1_0] Uncaught error in request completion: 
> (org.apache.kafka.clients.NetworkClient)
> [2020-02-24T13:23:29-08:00] 
> (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) 
> java.lang.IllegalStateException: Should not reopen a batch which is already 
> aborted.
>         at 
> org.apache.kafka.common.record.MemoryRecordsBuilder.reopenAndRewriteProducerState(MemoryRecordsBuilder.java:295)
>         at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.resetProducerState(ProducerBatch.java:395)
>         at 
> org.apache.kafka.clients.producer.internals.TransactionManager.lambda$adjustSequencesDueToFailedBatch$4(TransactionManager.java:770)
>         at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TopicPartitionEntry.resetSequenceNumbers(TransactionManager.java:180)
>         at 
> org.a

[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2020-03-11 Thread Matthias J. Sax (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057442#comment-17057442
 ] 

Matthias J. Sax commented on KAFKA-7965:


[https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5122/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/]

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9451) Pass consumer group metadata to producer on commit

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057441#comment-17057441
 ] 

ASF GitHub Bot commented on KAFKA-9451:
---

mjsax commented on pull request #8215: KAFKA-9451: Update MockConsumer to 
support ConsumerGroupMetadata
URL: https://github.com/apache/kafka/pull/8215
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Pass consumer group metadata to producer on commit
> --
>
> Key: KAFKA-9451
> URL: https://issues.apache.org/jira/browse/KAFKA-9451
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>
> Using producer per thread EOS design, we need to pass the consumer group 
> metadata into `producer.sendOffsetsToTransaction()` to use the new consumer 
> group coordinator fenchning mechanism. We should also reduce the default 
> transaction timeout to 10 seconds (compare the KIP for details).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9708) Connector does not prefer to use packaged classes during configuration

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057438#comment-17057438
 ] 

ASF GitHub Bot commented on KAFKA-9708:
---

gharris1727 commented on pull request #8281: KAFKA-9708: Use PluginClassLoader 
during connector startup
URL: https://github.com/apache/kafka/pull/8281
 
 
   * Use classloading prioritization from Worker::startTask in 
Worker::startConnector
   
   Signed-off-by: Greg Harris 
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Connector does not prefer to use packaged classes during configuration
> --
>
> Key: KAFKA-9708
> URL: https://issues.apache.org/jira/browse/KAFKA-9708
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Greg Harris
>Assignee: Greg Harris
>Priority: Major
>
> In connector tasks, classes loaded during configuration are preferentially 
> loaded from the PluginClassLoader since KAFKA-8819 was implemented. This same 
> prioritization is not currently respected in the connector itself, where the 
> delegating classloader is used as the context classloader. This leads to the 
> possibility for different versions of converters to be loaded, or different 
> versions of dependencies to be found when executing code in the connector vs 
> task.
> Worker::startConnector should be changed to follow the startTask / KAFKA-8819 
> prioritization scheme, by activating the PluginClassLoader earlier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9708) Connector does not prefer to use packaged classes during configuration

2020-03-11 Thread Greg Harris (Jira)

Greg Harris created KAFKA-9708:
--

 Summary: Connector does not prefer to use packaged classes during 
configuration
 Key: KAFKA-9708
 URL: https://issues.apache.org/jira/browse/KAFKA-9708
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Reporter: Greg Harris
Assignee: Greg Harris


In connector tasks, classes loaded during configuration are preferentially 
loaded from the PluginClassLoader since KAFKA-8819 was implemented. This same 
prioritization is not currently respected in the connector itself, where the 
delegating classloader is used as the context classloader. This leads to the 
possibility for different versions of converters to be loaded, or different 
versions of dependencies to be found when executing code in the connector vs 
task.

Worker::startConnector should be changed to follow the startTask / KAFKA-8819 
prioritization scheme, by activating the PluginClassLoader earlier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9707) InsertField.Key transformation does not apply to tombstone records

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057429#comment-17057429
 ] 

ASF GitHub Bot commented on KAFKA-9707:
---

gharris1727 commented on pull request #8280: KAFKA-9707: Fix InsertField.Key 
not applying to tombstone events
URL: https://github.com/apache/kafka/pull/8280
 
 
   * Fix typo that hardcoded .value() instead of abstract operatingValue
   * Add test for Key transform that was previously not tested
   
   Signed-off-by: Greg Harris 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> InsertField.Key transformation does not apply to tombstone records
> --
>
> Key: KAFKA-9707
> URL: https://issues.apache.org/jira/browse/KAFKA-9707
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Greg Harris
>Assignee: Greg Harris
>Priority: Major
>
> Reproduction steps:
>  # Configure an InsertField.Key transformation
>  # Pass a tombstone record (with non-null key, but null value) through the 
> transform
> Expected behavior:
> The key field is inserted, and the value remains null
> Observed behavior:
> The key field is not inserted, and the value remains null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9707) InsertField.Key transformation does not apply to tombstone records

2020-03-11 Thread Greg Harris (Jira)

Greg Harris created KAFKA-9707:
--

 Summary: InsertField.Key transformation does not apply to 
tombstone records
 Key: KAFKA-9707
 URL: https://issues.apache.org/jira/browse/KAFKA-9707
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Reporter: Greg Harris
Assignee: Greg Harris


Reproduction steps:
 # Configure an InsertField.Key transformation
 # Pass a tombstone record (with non-null key, but null value) through the 
transform

Expected behavior:

The key field is inserted, and the value remains null

Observed behavior:

The key field is not inserted, and the value remains null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9706) Flatten transformation fails when encountering tombstone event

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057408#comment-17057408
 ] 

ASF GitHub Bot commented on KAFKA-9706:
---

gharris1727 commented on pull request #8279: KAFKA-9706: Handle null 
keys/values in Flatten transformation
URL: https://github.com/apache/kafka/pull/8279
 
 
   * Fix DataException thrown when handling tombstone events with null value
   * Passes through original record when finding a tombstone record
   * Add tests for schema and schemaless data
   
   Signed-off-by: Greg Harris 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flatten transformation fails when encountering tombstone event
> --
>
> Key: KAFKA-9706
> URL: https://issues.apache.org/jira/browse/KAFKA-9706
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Greg Harris
>Assignee: Greg Harris
>Priority: Major
>
> When applying the {{Flatten}} transformation to a tombstone event, an 
> exception is raised:
> {code:java}
> org.apache.kafka.connect.errors.DataException: Only Map objects supported in 
> absence of schema for [flattening], found: null
> {code}
> Instead, the transform should pass the tombstone through the transform 
> without throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9706) Flatten transformation fails when encountering tombstone event

2020-03-11 Thread Greg Harris (Jira)

Greg Harris created KAFKA-9706:
--

 Summary: Flatten transformation fails when encountering tombstone 
event
 Key: KAFKA-9706
 URL: https://issues.apache.org/jira/browse/KAFKA-9706
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Reporter: Greg Harris
Assignee: Greg Harris


When applying the {{Flatten}} transformation to a tombstone event, an exception 
is raised:
{code:java}
org.apache.kafka.connect.errors.DataException: Only Map objects supported in 
absence of schema for [flattening], found: null
{code}
Instead, the transform should pass the tombstone through the transform without 
throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-1231) Support partition shrink (delete partition)

2020-03-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/KAFKA-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057392#comment-17057392
 ] 

Sönke Liebau commented on KAFKA-1231:
-

To be honest I think this caters to a bit of a fringe case and would create 
more issues than it could potentially solve - and since this issue has been 
open and uncommented for going on 6 years, anybody opposed to closing it?

> Support partition shrink (delete partition)
> ---
>
> Key: KAFKA-1231
> URL: https://issues.apache.org/jira/browse/KAFKA-1231
> Project: Kafka
>  Issue Type: Wish
>  Components: core
>Affects Versions: 0.8.0
>Reporter: Marc Labbe
>Priority: Minor
>
> Topics may have to be sized down once its peak usage has been passed. It 
> would be interesting to be able to reduce the number of partitions for a 
> topic.
> In reference to http://search-hadoop.com/m/4TaT4hQiGt1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-1518) KafkaMetricsReporter prevents Kafka from starting if the custom reporter throws an exception

2020-03-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/KAFKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057389#comment-17057389
 ] 

Sönke Liebau commented on KAFKA-1518:
-

Personally I'd say, that this behavior is fine. If you don't want Kafka to stop 
then the reporter should catch and ignore the exception. But if an exception is 
propagated up to Kafka, then it shouldn't just be ignored.

> KafkaMetricsReporter prevents Kafka from starting if the custom reporter 
> throws an exception
> 
>
> Key: KAFKA-1518
> URL: https://issues.apache.org/jira/browse/KAFKA-1518
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.1.1
>Reporter: Daniel Compton
>Priority: Major
>
> When Kafka starts up, it 
> [starts|https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/metrics/KafkaMetricsReporter.scala#L60]
>  custom metrics reporters. If any of these throw an exception on startup then 
> this will block Kafka from starting.
> For example using 
> [kafka-riemann-reporter|https://github.com/pingles/kafka-riemann-reporter], 
> if Riemann is not available then it will throw an [IO 
> Exception|https://github.com/pingles/kafka-riemann-reporter/issues/1] which 
> isn't caught by KafkaMetricsReporter. This means that Kafka will fail to 
> start until it can connect to Riemann, coupling a HA system to a non HA 
> system.
> It would probably be better to log the error and perhaps have a callback hook 
> to the reporter where they can handle it and startup in the background.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-1440) Per-request tracing

2020-03-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/KAFKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057379#comment-17057379
 ] 

Sönke Liebau commented on KAFKA-1440:
-

Looking through some old tickets and came across this. While certainly useful 
in principle, this seems like a fairly large piece of work by now. Can we close 
this, or is this something that someone feels strongly about and wants to look 
at?

> Per-request tracing
> ---
>
> Key: KAFKA-1440
> URL: https://issues.apache.org/jira/browse/KAFKA-1440
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Todd Palino
>Priority: Major
>
> Could we have a flag in requests (potentially in the header for all requests, 
> but at least for produce and fetch) that would enable tracing for that 
> request. Currently, if we want to debug an issue with a request, we need to 
> turn on trace level logging for the entire broker. If the client could ask 
> for the request to be traced, we could then log detailed information for just 
> that request.
> Ideally, this would be available as a flag that can be enabled in the client 
> via JMX, so it could be done without needing to restart the client 
> application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-5972) Flatten SMT does not work with null values

2020-03-11 Thread Greg Harris (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Harris updated KAFKA-5972:
---
Fix Version/s: 2.2.2
   2.4.0
   2.3.1
   2.1.2
   2.0.2
   1.1.2
   1.0.3

> Flatten SMT does not work with null values
> --
>
> Key: KAFKA-5972
> URL: https://issues.apache.org/jira/browse/KAFKA-5972
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tomas Zuklys
>Assignee: siva santhalingam
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.0.3, 1.1.2, 2.0.2, 2.1.2, 2.2.2, 2.4.0, 2.3.1
>
> Attachments: kafka-transforms.patch
>
>
> Hi,
> I noticed a bug in Flatten SMT while doing tests with different SMTs that are 
> provided out-of-box.
> Flatten SMT does not work as expected with schemaless JSON that has 
> properties with null values. 
> Example json:
> {code}
>   {A={D=dValue, B=null, C=cValue}}
> {code}
> The issue is in if statement that checks for null value.
> Current version:
> {code}
>   for (Map.Entry entry : originalRecord.entrySet()) {
> final String fieldName = fieldName(fieldNamePrefix, 
> entry.getKey());
> Object value = entry.getValue();
> if (value == null) {
> newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), 
> null);
> return;
> }
> ...
> {code}
> should be
> {code}
>   for (Map.Entry entry : originalRecord.entrySet()) {
> final String fieldName = fieldName(fieldNamePrefix, 
> entry.getKey());
> Object value = entry.getValue();
> if (value == null) {
> newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), 
> null);
> continue;
> }
> {code}
> I have attached a patch containing the fix for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-5972) Flatten SMT does not work with null values

2020-03-11 Thread Greg Harris (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Harris resolved KAFKA-5972.

Resolution: Fixed

> Flatten SMT does not work with null values
> --
>
> Key: KAFKA-5972
> URL: https://issues.apache.org/jira/browse/KAFKA-5972
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tomas Zuklys
>Assignee: siva santhalingam
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.0.3, 1.1.2, 2.0.2, 2.1.2, 2.3.1, 2.4.0, 2.2.2
>
> Attachments: kafka-transforms.patch
>
>
> Hi,
> I noticed a bug in Flatten SMT while doing tests with different SMTs that are 
> provided out-of-box.
> Flatten SMT does not work as expected with schemaless JSON that has 
> properties with null values. 
> Example json:
> {code}
>   {A={D=dValue, B=null, C=cValue}}
> {code}
> The issue is in if statement that checks for null value.
> Current version:
> {code}
>   for (Map.Entry entry : originalRecord.entrySet()) {
> final String fieldName = fieldName(fieldNamePrefix, 
> entry.getKey());
> Object value = entry.getValue();
> if (value == null) {
> newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), 
> null);
> return;
> }
> ...
> {code}
> should be
> {code}
>   for (Map.Entry entry : originalRecord.entrySet()) {
> final String fieldName = fieldName(fieldNamePrefix, 
> entry.getKey());
> Object value = entry.getValue();
> if (value == null) {
> newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), 
> null);
> continue;
> }
> {code}
> I have attached a patch containing the fix for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-1336) Create partition compaction analyzer

2020-03-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/KAFKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057373#comment-17057373
 ] 

Sönke Liebau commented on KAFKA-1336:
-

While I agree in principle that this information might be useful, the age and 
inactivity of this ticket suggests to me that this is not a priority and might 
be closed?


> Create partition compaction analyzer
> 
>
> Key: KAFKA-1336
> URL: https://issues.apache.org/jira/browse/KAFKA-1336
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jay Kreps
>Priority: Major
>
> It would be nice to have a tool that given a topic and partition reads the 
> full data and outputs:
> 1. The percentage of records that are duplicates
> 2. The percentage of records that have been deleted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-1292) Command-line tools should print what they do as part of their usage command

2020-03-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/KAFKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau resolved KAFKA-1292.
-
Resolution: Fixed

This has been implemented by now, I do believe within the linked ticket.

> Command-line tools should print what they do as part of their usage command
> ---
>
> Key: KAFKA-1292
> URL: https://issues.apache.org/jira/browse/KAFKA-1292
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jay Kreps
>Priority: Major
>  Labels: usability
>
> It would be nice if you could get an explanation for each tool by running the 
> command with no arguments. Something like 
> bin/kafka-preferred-replica-election.sh is a little scary so it would be nice 
> to have it self-describe:
> > bin/kafka-preferred-replica-election.sh
> This command attempts to return leadership to the preferred replicas (if they 
> are alive) from whomever is currently the leader).
> Option  Description   
>  
> --  ---   
>  
> --alter Alter the configuration for the topic.
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-1265) SBT and Gradle create jars without expected Maven files

2020-03-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/KAFKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057365#comment-17057365
 ] 

Sönke Liebau commented on KAFKA-1265:
-

While it still holds true that the jar itself doesn't contain these files, the 
appropriate pom is published to maven central by Gradle and serves this purpose 
to a large extent.
Also, as this has been open and uncommented for more than 6 years now, the pain 
seems not to be too great here :)

I'd suggest we can close this?

> SBT and Gradle create jars without expected Maven files
> ---
>
> Key: KAFKA-1265
> URL: https://issues.apache.org/jira/browse/KAFKA-1265
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.0, 0.8.1
>Reporter: Clark Breyman
>Priority: Minor
>  Labels: build
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The jar files produced and deployed to maven central do not embed the 
> expected Maven pom.xml and pom.properties files as would be expected by a 
> standard Maven-build artifact. This results in jars that do not self-document 
> their versions or dependencies. 
> For reference to the maven behavior, see addMavenDescriptor (defaults to 
> true): http://maven.apache.org/shared/maven-archiver/#archive
> Worst case, these files would need to be generated by the build and included 
> in the jar file. Gradle doesn't seem to have an automatic mechanism for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Raman Gupta (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057352#comment-17057352
 ] 

Raman Gupta edited comment on KAFKA-8803 at 3/11/20, 8:20 PM:
--

Here are the logs from this morning, bracketing the problem by an hour or so on 
each side, and includes logs from the broker shutdowns and restarts. The first 
instance of `UNKNOWN_LEADER_EPOCH` is at 14:38:36. 

Our client-side streams were failing from 14:59: to 16:40, the first instance 
of the error is at 14:59:41. The error always seems to occur for the first time 
after the client restarts, so its quite possible its related to a streams 
shutdown process, although I don't think so because the UNKNOWN_LEADER_EPOCH 
happened earlier. It seems like whatever the issue was occurred on the broker 
side, but as long as the client side didn't restart, things were ok. Just a 
theory though.

There is no IllegalStateException as seen by [~oleksii.boiko].

 [^logs-20200311.txt.gz]  [^logs-client-20200311.txt.gz] 



> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>    Priority: Major
>     Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, 
> logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Raman Gupta (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raman Gupta updated KAFKA-8803:
---
Attachment: logs-client-20200311.txt.gz
logs-20200311.txt.gz

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, 
> logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll

2020-03-11 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057325#comment-17057325
 ] 

Jun Rao commented on KAFKA-9693:


[~paolomoriello]: By metadata, I was referring to the metadata in the file 
system (e.g., file length, timestamp, etc). Basically, I am just wondering what 
contention is causing the producer spike. The producer is writing to a 
different file from the one being flushed. So, there is no obvious data 
contention.

> Kafka latency spikes caused by log segment flush on roll
> 
>
> Key: KAFKA-9693
> URL: https://issues.apache.org/jira/browse/KAFKA-9693
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
> Environment: OS: Amazon Linux 2
> Kafka version: 2.2.1
>Reporter: Paolo Moriello
>Assignee: Paolo Moriello
>Priority: Major
>  Labels: Performance, latency, performance
> Attachments: image-2020-03-10-13-17-34-618.png, 
> image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, 
> image-2020-03-10-15-00-54-204.png, latency_plot2.png
>
>
> h1. Summary
> When a log segment fills up, Kafka rolls over onto a new active segment and 
> force the flush of the old segment to disk. When this happens, log segment 
> _append_ duration increase causing important latency spikes on producer(s) 
> and replica(s). This ticket aims to highlight the problem and propose a 
> simple mitigation: add a new configuration to enable/disable rolled segment 
> flush.
> h1. 1. Phenomenon
> Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to 
> ~50x-200x more than usual. For instance, normally 99th %ile is lower than 
> 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th 
> %iles even jump to 500-700ms.
> Latency spikes happen at constant frequency (depending on the input 
> throughput), for small amounts of time. All the producers experience a 
> latency increase at the same time.
> h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314!
> {{Example of response time plot observed during on a single producer.}}
> URPs rarely appear in correspondence of the latency spikes too. This is 
> harder to reproduce, but from time to time it is possible to see a few 
> partitions going out of sync in correspondence of a spike.
> h1. 2. Experiment
> h2. 2.1 Setup
> Kafka cluster hosted on AWS EC2 instances.
> h4. Cluster
>  * 15 Kafka brokers: (EC2 m5.4xlarge)
>  ** Disk: 1100Gb EBS volumes (4750Mbps)
>  ** Network: 10 Gbps
>  ** CPU: 16 Intel Xeon Platinum 8000
>  ** Memory: 64Gb
>  * 3 Zookeeper nodes: m5.large
>  * 6 producers on 6 EC2 instances in the same region
>  * 1 topic, 90 partitions - replication factor=3
> h4. Broker config
> Relevant configurations:
> {quote}num.io.threads=8
>  num.replica.fetchers=2
>  offsets.topic.replication.factor=3
>  num.network.threads=5
>  num.recovery.threads.per.data.dir=2
>  min.insync.replicas=2
>  num.partitions=1
> {quote}
> h4. Perf Test
>  * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per 
> broker)
>  * record size = 2
>  * Acks = 1, linger.ms = 1, compression.type = none
>  * Test duration: ~20/30min
> h2. 2.2 Analysis
> Our analysis showed an high +correlation between log segment flush count/rate 
> and the latency spikes+. This indicates that the spikes in max latency are 
> related to Kafka behavior on rolling over new segments.
> The other metrics did not show any relevant impact on any hardware component 
> of the cluster, eg. cpu, memory, network traffic, disk throughput...
>  
>  !latency_plot2.png|width=924,height=308!
>  {{Correlation between latency spikes and log segment flush count. p50, p95, 
> p99, p999 and p latencies (left axis, ns) and the flush #count (right 
> axis, stepping blue line in plot).}}
> Kafka schedules logs flushing (this includes flushing the file record 
> containing log entries, the offset index, the timestamp index and the 
> transaction index) during _roll_ operations. A log is rolled over onto a new 
> empty log when:
>  * the log segment is full
>  * the maxtime has elapsed since the timestamp of first message in the 
> segment (or, in absence of it, since the create time)
>  * the index is full
> In this case, the increase in latency happens on _append_ of a new message 
> set to the active segment of the log. This is a synchronous operation which 
> therefore blocks producers requests, causing the latency increase.
> To confirm this, I instrumented Kafka to measure the duration of 
> FileRecords.append(MemoryRecords) method, which is responsible of writing 
> memory records to file. As a result, I observed the same spiky pattern as in 
> the producer latency, with a one-to-one correspondence with the append 
> duration.
>  !image-2020-0

[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll

2020-03-11 Thread Paolo Moriello (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057313#comment-17057313
 ] 

Paolo Moriello commented on KAFKA-9693:
---

[~junrao] thank you again for you replay. To clarify, when you say metadata, 
are you referring to the indexes or also something else?

In this case, it is likely due to a file system data contention only. I tried 
to temporarily disable the data log flush and got the same result in terms of 
latency improvement. Moreover, I observed the same spikes in the fswrite of the 
data itself.

> Kafka latency spikes caused by log segment flush on roll
> 
>
> Key: KAFKA-9693
> URL: https://issues.apache.org/jira/browse/KAFKA-9693
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
> Environment: OS: Amazon Linux 2
> Kafka version: 2.2.1
>Reporter: Paolo Moriello
>Assignee: Paolo Moriello
>Priority: Major
>  Labels: Performance, latency, performance
> Attachments: image-2020-03-10-13-17-34-618.png, 
> image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, 
> image-2020-03-10-15-00-54-204.png, latency_plot2.png
>
>
> h1. Summary
> When a log segment fills up, Kafka rolls over onto a new active segment and 
> force the flush of the old segment to disk. When this happens, log segment 
> _append_ duration increase causing important latency spikes on producer(s) 
> and replica(s). This ticket aims to highlight the problem and propose a 
> simple mitigation: add a new configuration to enable/disable rolled segment 
> flush.
> h1. 1. Phenomenon
> Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to 
> ~50x-200x more than usual. For instance, normally 99th %ile is lower than 
> 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th 
> %iles even jump to 500-700ms.
> Latency spikes happen at constant frequency (depending on the input 
> throughput), for small amounts of time. All the producers experience a 
> latency increase at the same time.
> h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314!
> {{Example of response time plot observed during on a single producer.}}
> URPs rarely appear in correspondence of the latency spikes too. This is 
> harder to reproduce, but from time to time it is possible to see a few 
> partitions going out of sync in correspondence of a spike.
> h1. 2. Experiment
> h2. 2.1 Setup
> Kafka cluster hosted on AWS EC2 instances.
> h4. Cluster
>  * 15 Kafka brokers: (EC2 m5.4xlarge)
>  ** Disk: 1100Gb EBS volumes (4750Mbps)
>  ** Network: 10 Gbps
>  ** CPU: 16 Intel Xeon Platinum 8000
>  ** Memory: 64Gb
>  * 3 Zookeeper nodes: m5.large
>  * 6 producers on 6 EC2 instances in the same region
>  * 1 topic, 90 partitions - replication factor=3
> h4. Broker config
> Relevant configurations:
> {quote}num.io.threads=8
>  num.replica.fetchers=2
>  offsets.topic.replication.factor=3
>  num.network.threads=5
>  num.recovery.threads.per.data.dir=2
>  min.insync.replicas=2
>  num.partitions=1
> {quote}
> h4. Perf Test
>  * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per 
> broker)
>  * record size = 2
>  * Acks = 1, linger.ms = 1, compression.type = none
>  * Test duration: ~20/30min
> h2. 2.2 Analysis
> Our analysis showed an high +correlation between log segment flush count/rate 
> and the latency spikes+. This indicates that the spikes in max latency are 
> related to Kafka behavior on rolling over new segments.
> The other metrics did not show any relevant impact on any hardware component 
> of the cluster, eg. cpu, memory, network traffic, disk throughput...
>  
>  !latency_plot2.png|width=924,height=308!
>  {{Correlation between latency spikes and log segment flush count. p50, p95, 
> p99, p999 and p latencies (left axis, ns) and the flush #count (right 
> axis, stepping blue line in plot).}}
> Kafka schedules logs flushing (this includes flushing the file record 
> containing log entries, the offset index, the timestamp index and the 
> transaction index) during _roll_ operations. A log is rolled over onto a new 
> empty log when:
>  * the log segment is full
>  * the maxtime has elapsed since the timestamp of first message in the 
> segment (or, in absence of it, since the create time)
>  * the index is full
> In this case, the increase in latency happens on _append_ of a new message 
> set to the active segment of the log. This is a synchronous operation which 
> therefore blocks producers requests, causing the latency increase.
> To confirm this, I instrumented Kafka to measure the duration of 
> FileRecords.append(MemoryRecords) method, which is responsible of writing 
> memory records to file. As a result, I observed the same spiky pattern as in 
> the producer

[jira] [Created] (KAFKA-9705) (Incremental)AlterConfig should be propagated from Controller in bridge release

2020-03-11 Thread Boyang Chen (Jira)

Boyang Chen created KAFKA-9705:
--

 Summary: (Incremental)AlterConfig should be propagated from 
Controller in bridge release
 Key: KAFKA-9705
 URL: https://issues.apache.org/jira/browse/KAFKA-9705
 Project: Kafka
  Issue Type: Sub-task
Reporter: Boyang Chen
Assignee: Boyang Chen


In the bridge release, we need to restrict the direct access of ZK to 
controller only. This means the existing AlterConfig path should be migrated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057296#comment-17057296
 ] 

Guozhang Wang commented on KAFKA-8803:
--

Raman, could you share the broker-side logs (everything, the server logs, 
controller logs, anything you have) so that we can help investigating? Did you 
see the IllegalStateExceptio that Oleskii saw?

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057291#comment-17057291
 ] 

Guozhang Wang commented on KAFKA-8803:
--

[~rocketraman] Also about your reported IllegalStateException, I just found 
that in later versions we have actually removed that illegal-state check as a 
warning: https://github.com/apache/kafka/pull/7840, this is as part of 2.5.0.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057285#comment-17057285
 ] 

Guozhang Wang edited comment on KAFKA-8803 at 3/11/20, 6:09 PM:


[~rocketraman] 

"Re: I don't think that is going to help. We currently let-it-crash in this 
situation, and retry on the next startup, and when this error happens, these 
retries do not work until the brokers are restarted."

I guess my previous reply is a bit misleading. What I meant is that from our 
investigation, some initPID timeout are actually transient -- it is just taking 
longer than 5 minutes -- due to the partitions unavailable and hence abortion 
record cannot be written and replicated, and hence the transition to 
abort-prepare cannot be completed. In this case retries would help.

In other cases, we also observe that the timeout is actually permanent (this is 
I believe you've encountered), and that we believe is a broker-side issue that 
we are still investigating, as I tried to explain in 2).


was (Author: guozhang):
[~rocketraman] I guess my previous reply is a bit misleading. What I meant is 
that from our investigation, some initPID timeout are actually transient -- it 
is just taking longer than 5 minutes -- due to the partitions unavailable and 
hence abortion record cannot be written and replicated, and hence the 
transition to abort-prepare cannot be completed. In this case retries would 
help.

In other cases, we also observe that the timeout is actually permanent (this is 
I believe you've encountered), and that we believe is a broker-side issue that 
we are still investigating, as I tried to explain in 2).

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057285#comment-17057285
 ] 

Guozhang Wang commented on KAFKA-8803:
--

[~rocketraman] "and due to time shift which may indeed be smaller if timer goes 
backwards. We should have a follow-up PR as in 
https://github.com/apache/kafka/pull/3286 to release this check. I will send a 
short PR shortly." I guess my previous reply is a bit misleading. What I meant 
is that from our investigation, some initPID timeout are actually transient -- 
it is just taking longer than 5 minutes -- due to the partitions unavailable 
and hence abortion record cannot be written and replicated, and hence the 
transition to abort-prepare cannot be completed. In this case retries would 
help.

In other cases, we also observe that the timeout is actually permanent (this is 
I believe you've encountered), and that we believe is a broker-side issue that 
we are still investigating, as I tried to explain in 2).

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057285#comment-17057285
 ] 

Guozhang Wang edited comment on KAFKA-8803 at 3/11/20, 6:08 PM:


[~rocketraman] I guess my previous reply is a bit misleading. What I meant is 
that from our investigation, some initPID timeout are actually transient -- it 
is just taking longer than 5 minutes -- due to the partitions unavailable and 
hence abortion record cannot be written and replicated, and hence the 
transition to abort-prepare cannot be completed. In this case retries would 
help.

In other cases, we also observe that the timeout is actually permanent (this is 
I believe you've encountered), and that we believe is a broker-side issue that 
we are still investigating, as I tried to explain in 2).


was (Author: guozhang):
[~rocketraman] "and due to time shift which may indeed be smaller if timer goes 
backwards. We should have a follow-up PR as in 
https://github.com/apache/kafka/pull/3286 to release this check. I will send a 
short PR shortly." I guess my previous reply is a bit misleading. What I meant 
is that from our investigation, some initPID timeout are actually transient -- 
it is just taking longer than 5 minutes -- due to the partitions unavailable 
and hence abortion record cannot be written and replicated, and hence the 
transition to abort-prepare cannot be completed. In this case retries would 
help.

In other cases, we also observe that the timeout is actually permanent (this is 
I believe you've encountered), and that we believe is a broker-side issue that 
we are still investigating, as I tried to explain in 2).

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Raman Gupta (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057281#comment-17057281
 ] 

Raman Gupta commented on KAFKA-8803:


[~guozhang] FYI we had this happen again today, and this time I did NOT see any 
errors similar to "The metadata cache for txn partition 22 has already exist 
with epoch 567 and 9 entries while trying to add to it; this should not happen" 
error, so it may or may not be related (probably not).

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057279#comment-17057279
 ] 

Guozhang Wang commented on KAFKA-8803:
--

https://github.com/apache/kafka/pull/8269

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057276#comment-17057276
 ] 

ASF GitHub Bot commented on KAFKA-8803:
---

guozhangwang commented on pull request #8278: KAFKA-8803: Remove timestamp 
check in completeTransitionTo
URL: https://github.com/apache/kafka/pull/8278
 
 
   In `prepareAddPartitions` the txnStartTimestamp could be updated as 
updateTimestamp, which is assumed to be always larger then the original 
startTimestamp. However, due to ntp time shift the timer may go backwards and 
hence the newStartTimestamp be smaller than the original one. Then later in 
completeTransitionTo the time check would fail with an IllegalStateException, 
and the txn would not transit to Ongoing.
   
   An indirect result of this, is that this txn would NEVER be expired anymore 
because only Ongoing ones would be checked for expiration.
   
   We should do the same as in https://github.com/apache/kafka/pull/3286 to 
remove this check.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057269#comment-17057269
 ] 

Guozhang Wang commented on KAFKA-8803:
--

[~oleksii.boiko] indeed had a great observation about one root cause of it: in 
`prepareAddPartitions` the txnStartTimestamp is updated as updateTimestamp, and 
due to time shift which may indeed be smaller if timer goes backwards. We 
should have a follow-up PR as in https://github.com/apache/kafka/pull/3286 to 
release this check. I will send a short PR shortly. 

[~rocketraman] I'm still investigating your observation of "The metadata cache 
for txn partition 22 has already exist with epoch 567 and 9 entries while 
trying to add to it; this should not happen", stay tuned.

cc [~hachikuji]

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Raman Gupta (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057262#comment-17057262
 ] 

Raman Gupta commented on KAFKA-8803:


Thanks [~guozhang]. I'm a little confused by your response. Regarding the 
timeout and timeout retry, as mentioned previously in this issue, I don't think 
that is going to help. We currently let-it-crash in this situation, and retry 
on the next startup, and when this error happens, these retries do not work 
until the brokers are restarted. Whatever state the brokers are in, is not 
fixed by retries nor by longer timeouts.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll

2020-03-11 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057256#comment-17057256
 ] 

Jun Rao commented on KAFKA-9693:


[~paolomoriello]: Thanks for the reply. Do you know if the increase latency is 
due to contention on the data or metadata in the file system? If it's the data, 
it seems that by delaying the flush long enough, the producer would have moved 
to a different block from the one being flushed. If it's the metadata, we flush 
other checkpoint files periodically and I am wondering it they cause producer 
latency spike too.

> Kafka latency spikes caused by log segment flush on roll
> 
>
> Key: KAFKA-9693
> URL: https://issues.apache.org/jira/browse/KAFKA-9693
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
> Environment: OS: Amazon Linux 2
> Kafka version: 2.2.1
>Reporter: Paolo Moriello
>Assignee: Paolo Moriello
>Priority: Major
>  Labels: Performance, latency, performance
> Attachments: image-2020-03-10-13-17-34-618.png, 
> image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, 
> image-2020-03-10-15-00-54-204.png, latency_plot2.png
>
>
> h1. Summary
> When a log segment fills up, Kafka rolls over onto a new active segment and 
> force the flush of the old segment to disk. When this happens, log segment 
> _append_ duration increase causing important latency spikes on producer(s) 
> and replica(s). This ticket aims to highlight the problem and propose a 
> simple mitigation: add a new configuration to enable/disable rolled segment 
> flush.
> h1. 1. Phenomenon
> Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to 
> ~50x-200x more than usual. For instance, normally 99th %ile is lower than 
> 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th 
> %iles even jump to 500-700ms.
> Latency spikes happen at constant frequency (depending on the input 
> throughput), for small amounts of time. All the producers experience a 
> latency increase at the same time.
> h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314!
> {{Example of response time plot observed during on a single producer.}}
> URPs rarely appear in correspondence of the latency spikes too. This is 
> harder to reproduce, but from time to time it is possible to see a few 
> partitions going out of sync in correspondence of a spike.
> h1. 2. Experiment
> h2. 2.1 Setup
> Kafka cluster hosted on AWS EC2 instances.
> h4. Cluster
>  * 15 Kafka brokers: (EC2 m5.4xlarge)
>  ** Disk: 1100Gb EBS volumes (4750Mbps)
>  ** Network: 10 Gbps
>  ** CPU: 16 Intel Xeon Platinum 8000
>  ** Memory: 64Gb
>  * 3 Zookeeper nodes: m5.large
>  * 6 producers on 6 EC2 instances in the same region
>  * 1 topic, 90 partitions - replication factor=3
> h4. Broker config
> Relevant configurations:
> {quote}num.io.threads=8
>  num.replica.fetchers=2
>  offsets.topic.replication.factor=3
>  num.network.threads=5
>  num.recovery.threads.per.data.dir=2
>  min.insync.replicas=2
>  num.partitions=1
> {quote}
> h4. Perf Test
>  * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per 
> broker)
>  * record size = 2
>  * Acks = 1, linger.ms = 1, compression.type = none
>  * Test duration: ~20/30min
> h2. 2.2 Analysis
> Our analysis showed an high +correlation between log segment flush count/rate 
> and the latency spikes+. This indicates that the spikes in max latency are 
> related to Kafka behavior on rolling over new segments.
> The other metrics did not show any relevant impact on any hardware component 
> of the cluster, eg. cpu, memory, network traffic, disk throughput...
>  
>  !latency_plot2.png|width=924,height=308!
>  {{Correlation between latency spikes and log segment flush count. p50, p95, 
> p99, p999 and p latencies (left axis, ns) and the flush #count (right 
> axis, stepping blue line in plot).}}
> Kafka schedules logs flushing (this includes flushing the file record 
> containing log entries, the offset index, the timestamp index and the 
> transaction index) during _roll_ operations. A log is rolled over onto a new 
> empty log when:
>  * the log segment is full
>  * the maxtime has elapsed since the timestamp of first message in the 
> segment (or, in absence of it, since the create time)
>  * the index is full
> In this case, the increase in latency happens on _append_ of a new message 
> set to the active segment of the log. This is a synchronous operation which 
> therefore blocks producers requests, causing the latency increase.
> To confirm this, I instrumented Kafka to measure the duration of 
> FileRecords.append(MemoryRecords) method, which is responsible of writing 
> memory records to file. As a result, I observed the same spiky pattern as

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057241#comment-17057241
 ] 

Guozhang Wang commented on KAFKA-8803:
--

[~rocketraman] Yes here are the current status:

1) On the streams side, I've merged a PR 
https://github.com/apache/kafka/pull/8060 which would handle the timeout 
exception gracefully and retry, it is merged in trunk and hence would be in the 
2.6.0 release (you can also try it out compiling from trunk).

2) The root cause of this long init-pid timing out, as far as we've 
investigated so far, is due to broker side partition availability and hence the 
txn record cannot be successfully written (note since the txn record is 
critical determining the state of the txn broker's require ack=all with min.isr 
in this append), and hence the current dangling txn cannot be aborted and thus 
the new PID cannot be granted. We're still working on the broker-side 
improvement to reduce the likelihood of this scenario that may cause timeout.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-03-11 Thread Raman Gupta (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057191#comment-17057191
 ] 

Raman Gupta commented on KAFKA-8803:


Any progress on this? Still happening regularly for us.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll

2020-03-11 Thread Paolo Moriello (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056893#comment-17056893
 ] 

Paolo Moriello commented on KAFKA-9693:
---

[~junrao] Thanks for the interest!

To answer your questions: (1) the filesystem used in the test is ext4 with 
jbd2. (2) yes, we could get the same benefit by delaying the flush; however, I 
could see the same effect (latency spikes mitigated) only with a delay >= 30s 
(because of default kernel page cache configuration).

I guess we could expose a configuration to control this delay, instead of 
on/off only as originally proposed.

> Kafka latency spikes caused by log segment flush on roll
> 
>
> Key: KAFKA-9693
> URL: https://issues.apache.org/jira/browse/KAFKA-9693
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
> Environment: OS: Amazon Linux 2
> Kafka version: 2.2.1
>Reporter: Paolo Moriello
>Assignee: Paolo Moriello
>Priority: Major
>  Labels: Performance, latency, performance
> Attachments: image-2020-03-10-13-17-34-618.png, 
> image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, 
> image-2020-03-10-15-00-54-204.png, latency_plot2.png
>
>
> h1. Summary
> When a log segment fills up, Kafka rolls over onto a new active segment and 
> force the flush of the old segment to disk. When this happens, log segment 
> _append_ duration increase causing important latency spikes on producer(s) 
> and replica(s). This ticket aims to highlight the problem and propose a 
> simple mitigation: add a new configuration to enable/disable rolled segment 
> flush.
> h1. 1. Phenomenon
> Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to 
> ~50x-200x more than usual. For instance, normally 99th %ile is lower than 
> 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th 
> %iles even jump to 500-700ms.
> Latency spikes happen at constant frequency (depending on the input 
> throughput), for small amounts of time. All the producers experience a 
> latency increase at the same time.
> h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314!
> {{Example of response time plot observed during on a single producer.}}
> URPs rarely appear in correspondence of the latency spikes too. This is 
> harder to reproduce, but from time to time it is possible to see a few 
> partitions going out of sync in correspondence of a spike.
> h1. 2. Experiment
> h2. 2.1 Setup
> Kafka cluster hosted on AWS EC2 instances.
> h4. Cluster
>  * 15 Kafka brokers: (EC2 m5.4xlarge)
>  ** Disk: 1100Gb EBS volumes (4750Mbps)
>  ** Network: 10 Gbps
>  ** CPU: 16 Intel Xeon Platinum 8000
>  ** Memory: 64Gb
>  * 3 Zookeeper nodes: m5.large
>  * 6 producers on 6 EC2 instances in the same region
>  * 1 topic, 90 partitions - replication factor=3
> h4. Broker config
> Relevant configurations:
> {quote}num.io.threads=8
>  num.replica.fetchers=2
>  offsets.topic.replication.factor=3
>  num.network.threads=5
>  num.recovery.threads.per.data.dir=2
>  min.insync.replicas=2
>  num.partitions=1
> {quote}
> h4. Perf Test
>  * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per 
> broker)
>  * record size = 2
>  * Acks = 1, linger.ms = 1, compression.type = none
>  * Test duration: ~20/30min
> h2. 2.2 Analysis
> Our analysis showed an high +correlation between log segment flush count/rate 
> and the latency spikes+. This indicates that the spikes in max latency are 
> related to Kafka behavior on rolling over new segments.
> The other metrics did not show any relevant impact on any hardware component 
> of the cluster, eg. cpu, memory, network traffic, disk throughput...
>  
>  !latency_plot2.png|width=924,height=308!
>  {{Correlation between latency spikes and log segment flush count. p50, p95, 
> p99, p999 and p latencies (left axis, ns) and the flush #count (right 
> axis, stepping blue line in plot).}}
> Kafka schedules logs flushing (this includes flushing the file record 
> containing log entries, the offset index, the timestamp index and the 
> transaction index) during _roll_ operations. A log is rolled over onto a new 
> empty log when:
>  * the log segment is full
>  * the maxtime has elapsed since the timestamp of first message in the 
> segment (or, in absence of it, since the create time)
>  * the index is full
> In this case, the increase in latency happens on _append_ of a new message 
> set to the active segment of the log. This is a synchronous operation which 
> therefore blocks producers requests, causing the latency increase.
> To confirm this, I instrumented Kafka to measure the duration of 
> FileRecords.append(MemoryRecords) method, which is responsible of writing 
> memory records to file. As a result, I obse

[jira] [Comment Edited] (KAFKA-9458) Kafka crashed in windows environment

2020-03-11 Thread hirik (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021999#comment-17021999
 ] 

hirik edited comment on KAFKA-9458 at 3/11/20, 10:13 AM:
-

[~manme...@gmail.com], I created a git pull request. can you please review the 
changes? 

[https://github.com/apache/kafka/pull/8276]


was (Author: hirik):
[~manme...@gmail.com], I created a git pull request. can you please review the 
changes? 

[https://github.com/apache/kafka/pull/8276] 
[|https://github.com/apache/kafka/pull/8002]

> Kafka crashed in windows environment
> 
>
> Key: KAFKA-9458
> URL: https://issues.apache.org/jira/browse/KAFKA-9458
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 2.4.0
> Environment: Windows Server 2019
>Reporter: hirik
>Priority: Critical
>  Labels: windows
> Fix For: 2.6.0
>
> Attachments: logs.zip
>
>
> Hi,
> while I was trying to validate Kafka retention policy, Kafka Server crashed 
> with below exception trace. 
> [2020-01-21 17:10:40,475] INFO [Log partition=test1-3, 
> dir=C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka] 
> Rolled new log segment at offset 1 in 52 ms. (kafka.log.Log)
> [2020-01-21 17:10:40,484] ERROR Error while deleting segments for test1-3 in 
> dir C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka 
> (kafka.server.LogDirFailureChannel)
> java.nio.file.FileSystemException: 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex
>  -> 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted:
>  The process cannot access the file because it is being used by another 
> process.
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
>  at 
> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
>  at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:395)
>  at 
> java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:292)
>  at java.base/java.nio.file.Files.move(Files.java:1425)
>  at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:795)
>  at kafka.log.AbstractIndex.renameTo(AbstractIndex.scala:209)
>  at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:497)
>  at kafka.log.Log.$anonfun$deleteSegmentFiles$1(Log.scala:2206)
>  at kafka.log.Log.$anonfun$deleteSegmentFiles$1$adapted(Log.scala:2206)
>  at scala.collection.immutable.List.foreach(List.scala:305)
>  at kafka.log.Log.deleteSegmentFiles(Log.scala:2206)
>  at kafka.log.Log.removeAndDeleteSegments(Log.scala:2191)
>  at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1700)
>  at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.scala:17)
>  at kafka.log.Log.maybeHandleIOException(Log.scala:2316)
>  at kafka.log.Log.deleteSegments(Log.scala:1691)
>  at kafka.log.Log.deleteOldSegments(Log.scala:1686)
>  at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1763)
>  at kafka.log.Log.deleteOldSegments(Log.scala:1753)
>  at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:982)
>  at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:979)
>  at scala.collection.immutable.List.foreach(List.scala:305)
>  at kafka.log.LogManager.cleanupLogs(LogManager.scala:979)
>  at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:403)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65)
>  at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>  at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>  at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:830)
>  Suppressed: java.nio.file.FileSystemException: 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex
>  -> 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted:
>  The process cannot access the file because it is being used by another 
> process.
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
>  at 
> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
>  at java.base/sun.nio.fs.Wi

[jira] [Comment Edited] (KAFKA-9458) Kafka crashed in windows environment

2020-03-11 Thread hirik (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021999#comment-17021999
 ] 

hirik edited comment on KAFKA-9458 at 3/11/20, 10:13 AM:
-

[~manme...@gmail.com], I created a git pull request. can you please review the 
changes? 

[https://github.com/apache/kafka/pull/8276] 
[|https://github.com/apache/kafka/pull/8002]


was (Author: hirik):
[~manme...@gmail.com], I created a git pull request. can you please review the 
changes? 

[https://github.com/apache/kafka/pull/8002]

> Kafka crashed in windows environment
> 
>
> Key: KAFKA-9458
> URL: https://issues.apache.org/jira/browse/KAFKA-9458
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 2.4.0
> Environment: Windows Server 2019
>Reporter: hirik
>Priority: Critical
>  Labels: windows
> Fix For: 2.6.0
>
> Attachments: logs.zip
>
>
> Hi,
> while I was trying to validate Kafka retention policy, Kafka Server crashed 
> with below exception trace. 
> [2020-01-21 17:10:40,475] INFO [Log partition=test1-3, 
> dir=C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka] 
> Rolled new log segment at offset 1 in 52 ms. (kafka.log.Log)
> [2020-01-21 17:10:40,484] ERROR Error while deleting segments for test1-3 in 
> dir C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka 
> (kafka.server.LogDirFailureChannel)
> java.nio.file.FileSystemException: 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex
>  -> 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted:
>  The process cannot access the file because it is being used by another 
> process.
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
>  at 
> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
>  at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:395)
>  at 
> java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:292)
>  at java.base/java.nio.file.Files.move(Files.java:1425)
>  at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:795)
>  at kafka.log.AbstractIndex.renameTo(AbstractIndex.scala:209)
>  at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:497)
>  at kafka.log.Log.$anonfun$deleteSegmentFiles$1(Log.scala:2206)
>  at kafka.log.Log.$anonfun$deleteSegmentFiles$1$adapted(Log.scala:2206)
>  at scala.collection.immutable.List.foreach(List.scala:305)
>  at kafka.log.Log.deleteSegmentFiles(Log.scala:2206)
>  at kafka.log.Log.removeAndDeleteSegments(Log.scala:2191)
>  at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1700)
>  at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.scala:17)
>  at kafka.log.Log.maybeHandleIOException(Log.scala:2316)
>  at kafka.log.Log.deleteSegments(Log.scala:1691)
>  at kafka.log.Log.deleteOldSegments(Log.scala:1686)
>  at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1763)
>  at kafka.log.Log.deleteOldSegments(Log.scala:1753)
>  at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:982)
>  at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:979)
>  at scala.collection.immutable.List.foreach(List.scala:305)
>  at kafka.log.LogManager.cleanupLogs(LogManager.scala:979)
>  at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:403)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65)
>  at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>  at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>  at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:830)
>  Suppressed: java.nio.file.FileSystemException: 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex
>  -> 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted:
>  The process cannot access the file because it is being used by another 
> process.
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
>  at 
> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
>  at java.base/sun.nio.fs.Wi

[jira] [Commented] (KAFKA-9458) Kafka crashed in windows environment

2020-03-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056819#comment-17056819
 ] 

ASF GitHub Bot commented on KAFKA-9458:
---

hirik commented on pull request #8002: KAFKA-9458 Fix windows clean log fail 
caused Runtime.halt
URL: https://github.com/apache/kafka/pull/8002
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kafka crashed in windows environment
> 
>
> Key: KAFKA-9458
> URL: https://issues.apache.org/jira/browse/KAFKA-9458
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 2.4.0
> Environment: Windows Server 2019
>Reporter: hirik
>Priority: Critical
>  Labels: windows
> Fix For: 2.6.0
>
> Attachments: logs.zip
>
>
> Hi,
> while I was trying to validate Kafka retention policy, Kafka Server crashed 
> with below exception trace. 
> [2020-01-21 17:10:40,475] INFO [Log partition=test1-3, 
> dir=C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka] 
> Rolled new log segment at offset 1 in 52 ms. (kafka.log.Log)
> [2020-01-21 17:10:40,484] ERROR Error while deleting segments for test1-3 in 
> dir C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka 
> (kafka.server.LogDirFailureChannel)
> java.nio.file.FileSystemException: 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex
>  -> 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted:
>  The process cannot access the file because it is being used by another 
> process.
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
>  at 
> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
>  at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:395)
>  at 
> java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:292)
>  at java.base/java.nio.file.Files.move(Files.java:1425)
>  at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:795)
>  at kafka.log.AbstractIndex.renameTo(AbstractIndex.scala:209)
>  at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:497)
>  at kafka.log.Log.$anonfun$deleteSegmentFiles$1(Log.scala:2206)
>  at kafka.log.Log.$anonfun$deleteSegmentFiles$1$adapted(Log.scala:2206)
>  at scala.collection.immutable.List.foreach(List.scala:305)
>  at kafka.log.Log.deleteSegmentFiles(Log.scala:2206)
>  at kafka.log.Log.removeAndDeleteSegments(Log.scala:2191)
>  at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1700)
>  at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.scala:17)
>  at kafka.log.Log.maybeHandleIOException(Log.scala:2316)
>  at kafka.log.Log.deleteSegments(Log.scala:1691)
>  at kafka.log.Log.deleteOldSegments(Log.scala:1686)
>  at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1763)
>  at kafka.log.Log.deleteOldSegments(Log.scala:1753)
>  at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:982)
>  at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:979)
>  at scala.collection.immutable.List.foreach(List.scala:305)
>  at kafka.log.LogManager.cleanupLogs(LogManager.scala:979)
>  at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:403)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65)
>  at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>  at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>  at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:830)
>  Suppressed: java.nio.file.FileSystemException: 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex
>  -> 
> C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted:
>  The process cannot access the file because it is being used by another 
> process.
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
>  at 
> java.base/sun.n

[jira] [Commented] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Nishant Ranjan (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056795#comment-17056795
 ] 

Nishant Ranjan commented on KAFKA-9702:
---

Hi Alexandre,

I attached hprof dump.

What I did this time, I simply poll from topic which has 500,000 records. No 
processing.

Leak suspects has report also. As mentioned earlier, I used MAT.

> Suspected memory leak
> -
>
> Key: KAFKA-9702
> URL: https://issues.apache.org/jira/browse/KAFKA-9702
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.4.0
>Reporter: Nishant Ranjan
>Priority: Major
> Attachments: java_pid8020.0001.hprof.7z, 
> java_pid8020.0001_Leak_Suspects.zip
>
>
> I am using Kafka consumer to fetch objects from Kafka topic and then persist 
> them in DB.
> When I ran, eclipse MAT for memory leaks, its giving following :
> 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes.
> Also, my observation is that GC is not collecting objects.
> Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Nishant Ranjan (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Ranjan updated KAFKA-9702:
--
Attachment: java_pid8020.0001_Leak_Suspects.zip

> Suspected memory leak
> -
>
> Key: KAFKA-9702
> URL: https://issues.apache.org/jira/browse/KAFKA-9702
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.4.0
>Reporter: Nishant Ranjan
>Priority: Major
> Attachments: java_pid8020.0001.hprof.7z, 
> java_pid8020.0001_Leak_Suspects.zip
>
>
> I am using Kafka consumer to fetch objects from Kafka topic and then persist 
> them in DB.
> When I ran, eclipse MAT for memory leaks, its giving following :
> 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes.
> Also, my observation is that GC is not collecting objects.
> Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Nishant Ranjan (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Ranjan updated KAFKA-9702:
--
Attachment: java_pid8020.0001.hprof.7z

> Suspected memory leak
> -
>
> Key: KAFKA-9702
> URL: https://issues.apache.org/jira/browse/KAFKA-9702
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.4.0
>Reporter: Nishant Ranjan
>Priority: Major
> Attachments: java_pid8020.0001.hprof.7z
>
>
> I am using Kafka consumer to fetch objects from Kafka topic and then persist 
> them in DB.
> When I ran, eclipse MAT for memory leaks, its giving following :
> 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes.
> Also, my observation is that GC is not collecting objects.
> Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9704) z/OS won't let us resize file when mmap

2020-03-11 Thread Shuo Zhang (Jira)

Shuo Zhang created KAFKA-9704:
-

 Summary: z/OS won't let us resize file when mmap
 Key: KAFKA-9704
 URL: https://issues.apache.org/jira/browse/KAFKA-9704
 Project: Kafka
  Issue Type: Task
  Components: log
Affects Versions: 2.4.0
Reporter: Shuo Zhang
 Fix For: 2.5.0, 2.4.2


z/OS won't let us resize file when mmap, so we need to force unman like Windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9703) ProducerBatch.split takes up too many resources if the bigBatch is huge

2020-03-11 Thread jiamei xie (Jira)

jiamei xie created KAFKA-9703:
-

 Summary: ProducerBatch.split takes up too many resources if the 
bigBatch is huge
 Key: KAFKA-9703
 URL: https://issues.apache.org/jira/browse/KAFKA-9703
 Project: Kafka
  Issue Type: Bug
Reporter: jiamei xie


ProducerBatch.split takes up too many resources  and might cause outOfMemory 
error if the bigBatch is huge. About how I found this issue is in 
https://lists.apache.org/list.html?us...@kafka.apache.org:lte=1M:MESSAGE_TOO_LARGE

Following is the code which takes a lot of resources.

{code:java}
 for (Record record : recordBatch) {
assert thunkIter.hasNext();
Thunk thunk = thunkIter.next();
if (batch == null)
batch = createBatchOffAccumulatorForRecord(record, 
splitBatchSize);

// A newly created batch can always host the first message.
if (!batch.tryAppendForSplit(record.timestamp(), record.key(), 
record.value(), record.headers(), thunk)) {
batches.add(batch);
batch = createBatchOffAccumulatorForRecord(record, 
splitBatchSize);
batch.tryAppendForSplit(record.timestamp(), record.key(), 
record.value(), record.headers(), thunk);
}
{code}

Refer to RecordAccumulator#tryAppend, we can call closeForRecordAppends() after 
a batch is full.

{code:java}
private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] 
value, Header[] headers,
 Callback callback, 
Deque deque, long nowMs) {
ProducerBatch last = deque.peekLast();
if (last != null) {
FutureRecordMetadata future = last.tryAppend(timestamp, key, value, 
headers, callback, nowMs);
if (future == null)
last.closeForRecordAppends();
else
return new RecordAppendResult(future, deque.size() > 1 || 
last.isFull(), false, false);
}
return null;
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.

2020-03-11 Thread Shuo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Zhang closed KAFKA-9343.
-

> Add ps command for Kafka and zookeeper process on z/OS.
> ---
>
> Key: KAFKA-9343
> URL: https://issues.apache.org/jira/browse/KAFKA-9343
> Project: Kafka
>  Issue Type: Task
>  Components: tools
>Affects Versions: 2.4.0
> Environment: z/OS, OS/390
>Reporter: Shuo Zhang
>Priority: Major
>  Labels: OS/390, z/OS
> Fix For: 2.5.0, 2.4.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> +Note: since the final change scope changed, I changed the summary and 
> description.+ 
> The existing method to check Kafka process for other platform doesn't 
> applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. 
> PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk 
> '\{print $1}') 
> --> 
> PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v 
> grep | awk '\{print $1}') 
> So does the zookeeper process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.

2020-03-11 Thread Shuo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Zhang updated KAFKA-9343:
--
Description: 
+Note: since the final change scope changed, I changed the summary and 
description.+ 

The existing method to check Kafka process for other platform doesn't 
applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. 

PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '\{print 
$1}') 

--> 

PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v grep 
| awk '\{print $1}') 

So does the zookeeper process.

> Add ps command for Kafka and zookeeper process on z/OS.
> ---
>
> Key: KAFKA-9343
> URL: https://issues.apache.org/jira/browse/KAFKA-9343
> Project: Kafka
>  Issue Type: Task
>  Components: tools
>Affects Versions: 2.4.0
> Environment: z/OS, OS/390
>Reporter: Shuo Zhang
>Priority: Major
>  Labels: OS/390, z/OS
> Fix For: 2.5.0, 2.4.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> +Note: since the final change scope changed, I changed the summary and 
> description.+ 
> The existing method to check Kafka process for other platform doesn't 
> applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. 
> PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk 
> '\{print $1}') 
> --> 
> PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v 
> grep | awk '\{print $1}') 
> So does the zookeeper process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.

2020-03-11 Thread Shuo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Zhang updated KAFKA-9343:
--
Description: (was: +Note: since the final change scope changed, I 
changed the summary and description.+

The existing method to check Kafka process for other platform doesn't 
applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME.

PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print 
$1}')

-->

PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v grep 
| awk '{print $1}')

So does the zookeeper process.)

> Add ps command for Kafka and zookeeper process on z/OS.
> ---
>
> Key: KAFKA-9343
> URL: https://issues.apache.org/jira/browse/KAFKA-9343
> Project: Kafka
>  Issue Type: Task
>  Components: tools
>Affects Versions: 2.4.0
> Environment: z/OS, OS/390
>Reporter: Shuo Zhang
>Priority: Major
>  Labels: OS/390, z/OS
> Fix For: 2.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.

2020-03-11 Thread Shuo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Zhang updated KAFKA-9343:
--
Fix Version/s: 2.4.2

> Add ps command for Kafka and zookeeper process on z/OS.
> ---
>
> Key: KAFKA-9343
> URL: https://issues.apache.org/jira/browse/KAFKA-9343
> Project: Kafka
>  Issue Type: Task
>  Components: tools
>Affects Versions: 2.4.0
> Environment: z/OS, OS/390
>Reporter: Shuo Zhang
>Priority: Major
>  Labels: OS/390, z/OS
> Fix For: 2.5.0, 2.4.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.

2020-03-11 Thread Shuo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Zhang updated KAFKA-9343:
--
Fix Version/s: (was: 2.6.0)
   2.5.0
  Description: 
+Note: since the final change scope changed, I changed the summary and 
description.+

The existing method to check Kafka process for other platform doesn't 
applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME.

PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print 
$1}')

-->

PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v grep 
| awk '{print $1}')

So does the zookeeper process.

  was:To make Kafka runnable on z/OS， I need some changes to the shell script 
and java/scala files, to 

  Summary: Add ps command for Kafka and zookeeper process on z/OS.  
(was: Make Kafka shell runnable on z/OS)

> Add ps command for Kafka and zookeeper process on z/OS.
> ---
>
> Key: KAFKA-9343
> URL: https://issues.apache.org/jira/browse/KAFKA-9343
> Project: Kafka
>  Issue Type: Task
>  Components: tools
>Affects Versions: 2.4.0
> Environment: z/OS, OS/390
>Reporter: Shuo Zhang
>Priority: Major
>  Labels: OS/390, z/OS
> Fix For: 2.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> +Note: since the final change scope changed, I changed the summary and 
> description.+
> The existing method to check Kafka process for other platform doesn't 
> applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME.
> PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk 
> '{print $1}')
> -->
> PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v 
> grep | awk '{print $1}')
> So does the zookeeper process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9698) Wrong default max.message.bytes in document

2020-03-11 Thread Tom Bentley (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056751#comment-17056751
 ] 

Tom Bentley commented on KAFKA-9698:


KAFKA-4203 has been fixed for Kafka 2.5 which is not released yet, and 
therefore not reflected in the documentation currently at 
http://kafka.apache.org/documentation/. When 2.5 is released the documentation 
will be updated.

> Wrong default max.message.bytes in document
> ---
>
> Key: KAFKA-9698
> URL: https://issues.apache.org/jira/browse/KAFKA-9698
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.4.0
>Reporter: jiamei xie
>Priority: Major
>
> The broker default for max.message.byte  has been changed  to 1048588 in 
> https://issues.apache.org/jira/browse/KAFKA-4203. But the default value in 
> http://kafka.apache.org/documentation/ is still 112.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9343) Make Kafka shell runnable on z/OS

2020-03-11 Thread Shuo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Zhang updated KAFKA-9343:
--
Description: To make Kafka runnable on z/OS， I need some changes to the 
shell script and java/scala files, to   (was: To make Kafka runnable on z/OS, I 
need to change the following 4 shell scripts:

kafka-run-class.sh, kafka-server-start/sh, kafka-server-stop.sh, 
zookeeper-server-stop.sh )

> Make Kafka shell runnable on z/OS
> -
>
> Key: KAFKA-9343
> URL: https://issues.apache.org/jira/browse/KAFKA-9343
> Project: Kafka
>  Issue Type: Task
>  Components: tools
>Affects Versions: 2.4.0
> Environment: z/OS, OS/390
>Reporter: Shuo Zhang
>Priority: Major
>  Labels: OS/390, z/OS
> Fix For: 2.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> To make Kafka runnable on z/OS， I need some changes to the shell script and 
> java/scala files, to 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Alexandre Dupriez (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056739#comment-17056739
 ] 

Alexandre Dupriez commented on KAFKA-9702:
--

Could you please provide a link to a stored heap dump? Thanks!

> Suspected memory leak
> -
>
> Key: KAFKA-9702
> URL: https://issues.apache.org/jira/browse/KAFKA-9702
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.4.0
>Reporter: Nishant Ranjan
>Priority: Major
>
> I am using Kafka consumer to fetch objects from Kafka topic and then persist 
> them in DB.
> When I ran, eclipse MAT for memory leaks, its giving following :
> 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes.
> Also, my observation is that GC is not collecting objects.
> Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Nishant Ranjan (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056738#comment-17056738
 ] 

Nishant Ranjan commented on KAFKA-9702:
---

When I execute proper code on linux server with approx 500,000/600,000 then 
memory shoots to 6GB and then I gets OOM (memory constraints).

Above was sample from laptop on windows (kind-off "Hello World" of problem).

To identify the memory issue, I just wrote 3 lines :

while(true) {

consumerRecords = consumer.poll(1000) ;

consumerRecords = null;

}

Even with above MAT is showing probable leak. Maybe, I am missing some Kafka 
option but there is some memory issue.

> Suspected memory leak
> -
>
> Key: KAFKA-9702
> URL: https://issues.apache.org/jira/browse/KAFKA-9702
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.4.0
>Reporter: Nishant Ranjan
>Priority: Major
>
> I am using Kafka consumer to fetch objects from Kafka topic and then persist 
> them in DB.
> When I ran, eclipse MAT for memory leaks, its giving following :
> 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes.
> Also, my observation is that GC is not collecting objects.
> Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Alexandre Dupriez (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056725#comment-17056725
 ] 

Alexandre Dupriez commented on KAFKA-9702:
--

22,33,048 bytes - do you mean 22 MB?

> Suspected memory leak
> -
>
> Key: KAFKA-9702
> URL: https://issues.apache.org/jira/browse/KAFKA-9702
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.4.0
>Reporter: Nishant Ranjan
>Priority: Major
>
> I am using Kafka consumer to fetch objects from Kafka topic and then persist 
> them in DB.
> When I ran, eclipse MAT for memory leaks, its giving following :
> 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes.
> Also, my observation is that GC is not collecting objects.
> Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9702) Suspected memory leak

2020-03-11 Thread Nishant Ranjan (Jira)

Nishant Ranjan created KAFKA-9702:
-

 Summary: Suspected memory leak
 Key: KAFKA-9702
 URL: https://issues.apache.org/jira/browse/KAFKA-9702
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Affects Versions: 2.4.0
Reporter: Nishant Ranjan


I am using Kafka consumer to fetch objects from Kafka topic and then persist 
them in DB.

When I ran, eclipse MAT for memory leaks, its giving following :
53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *""* occupy *22,33,048 (31.01%)* bytes.

Also, my observation is that GC is not collecting objects.

Please let me know if more information is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

64 matches

Mail list logo