[jira] [Created] (KAFKA-9323) Refactor Streams' upgrade system tests

2019-12-20 Thread Sophie Blee-Goldman (Jira)
Sophie Blee-Goldman created KAFKA-9323:
--

 Summary: Refactor Streams'  upgrade system tests
 Key: KAFKA-9323
 URL: https://issues.apache.org/jira/browse/KAFKA-9323
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: Sophie Blee-Goldman


With the introduction of version probing in 2.0 and cooperative rebalancing in 
2.4, the specific upgrade path depends heavily on the to & from version. This 
can be a complex operation, and we should make sure to test a realistic upgrade 
scenario across all possible combinations. The current system tests have gaps 
however, which have allowed bugs in the upgrade path to slip by unnoticed for 
several versions. 

Our current system tests include a metadata upgrade test, a version probing 
test, and a cooperative upgrade test. This has a few drawbacks:

a) only the version probing test tests "forwards compatibility" (upgrade from 
latest to future version)

b) nothing tests version probing "backwards compatibility" (upgrade from older 
version to latest), except:

c) the cooperative rebalancing test actually happens to involve a version 
probing step, and so coincidentally DOES test VP (but only starting with 2.4)

d) each test itself tries to test the upgrade across different versions, 
meaning there may be overlap and/or unnecessary tests 

e) as new versions are released, it is unclear to those not directly involved 
in these tests and/or projects whether and what needs to be updated (eg should 
this new version be added to the cooperative test? should the old version be 
aded to the metadata test?)

We should definitely fill in the testing gap here, but how to do so is of 
course up for discussion.

I would propose to refactor the upgrade tests, and rather than maintain 
different lists of versions to pass as input to each different test, we should 
have a single matrix that we update with each new version that specifies which 
upgrade path that version combination actually requires. We can then loop 
through each version combination and test only the actual upgrade path that 
users would actually need to follow. This way we can be sure we are not missing 
anything, as each and every possible upgrade would be tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9173) StreamsPartitionAssignor assigns partitions to only one worker

2019-12-20 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001277#comment-17001277
 ] 

Guozhang Wang commented on KAFKA-9173:
--

Sounds good to me.

> StreamsPartitionAssignor assigns partitions to only one worker
> --
>
> Key: KAFKA-9173
> URL: https://issues.apache.org/jira/browse/KAFKA-9173
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Oleg Muravskiy
>Priority: Major
>  Labels: user-experience
> Attachments: StreamsPartitionAssignor.log
>
>
> I'm running a distributed KafkaStreams application on 10 worker nodes, 
> subscribed to 21 topics with 10 partitions in each. I'm only using a 
> Processor interface, and a persistent state store.
> However, only one worker gets assigned partitions, all other workers get 
> nothing. Restarting the application, or cleaning local state stores does not 
> help. StreamsPartitionAssignor migrates to other nodes, and eventually picks 
> up other node to assign partitions to, but still only one node.
> It's difficult to figure out where to look for the signs of problems, I'm 
> attaching the log messages from the StreamsPartitionAssignor. Let me know 
> what else I could provide to help resolve this.
> [^StreamsPartitionAssignor.log]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9307) Transaction coordinator could be left in unknown state after ZK session timeout

2019-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001266#comment-17001266
 ] 

ASF GitHub Bot commented on KAFKA-9307:
---

dhruvilshah3 commented on pull request #7840: KAFKA-9307: Make transaction 
metadata loading resilient to previous errors
URL: https://github.com/apache/kafka/pull/7840
 
 
   Allow transaction metadata to be reloaded, even if it already exists as of a 
previous epoch. This helps with cases where a previous become-follower 
transition failed to unload corresponding metadata.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Transaction coordinator could be left in unknown state after ZK session 
> timeout
> ---
>
> Key: KAFKA-9307
> URL: https://issues.apache.org/jira/browse/KAFKA-9307
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Dhruvil Shah
>Assignee: Dhruvil Shah
>Priority: Major
>
> We observed a case where the transaction coordinator could not load 
> transaction state from __transaction-state topic partition. Clients would 
> continue seeing COORDINATOR_LOAD_IN_PROGRESS exceptions until the broker 
> hosting the coordinator is restarted.
> This is the sequence of events that leads to the issue:
>  # The broker is the leader of one (or more) transaction state topic 
> partitions.
>  # The broker loses its ZK session due to a network issue.
>  # Broker reestablishes session with ZK, though there are still transient 
> network issues.
>  # Broker is made follower of the transaction state topic partition it was 
> leading earlier.
>  # During the become-follower transition, the broker loses its ZK session 
> again.
>  # The become-follower transition for this broker fails in-between, leaving 
> us in a partial leader / partial follower state for the transaction topic. 
> This meant that we could not unload the transaction metadata. However, the 
> broker successfully caches the leader epoch of associated with the 
> LeaderAndIsrRequest.
>  # Later, when the ZK session is finally established successfully, the broker 
> ignores the become-follower transition as the leader epoch was same as the 
> one it had cached. This prevented the transaction metadata from being 
> unloaded.
>  # Because this partition was a partial follower, we had setup replica 
> fetchers. The partition continued to fetch from the leader until it was made 
> part of the ISR.
>  # Once it was part of the ISR, preferred leader election kicked in and 
> elected this broker as the leader.
>  # When processing the become-leader transition, the transaction state load 
> operation failed as we already had transaction metadata loaded at a previous 
> epoch.
>  # This meant that this partition was left in the "loading" state and we thus 
> returned COORDINATOR_LOAD_IN_PROGRESS errors.
> Restarting the broker that hosts the transaction state coordinator is the 
> only way to recover from this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9307) Transaction coordinator could be left in unknown state after ZK session timeout

2019-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001265#comment-17001265
 ] 

ASF GitHub Bot commented on KAFKA-9307:
---

dhruvilshah3 commented on pull request #7840: KAFKA-9307: Make transaction 
metadata loading resilient to previous errors
URL: https://github.com/apache/kafka/pull/7840
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Transaction coordinator could be left in unknown state after ZK session 
> timeout
> ---
>
> Key: KAFKA-9307
> URL: https://issues.apache.org/jira/browse/KAFKA-9307
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Dhruvil Shah
>Assignee: Dhruvil Shah
>Priority: Major
>
> We observed a case where the transaction coordinator could not load 
> transaction state from __transaction-state topic partition. Clients would 
> continue seeing COORDINATOR_LOAD_IN_PROGRESS exceptions until the broker 
> hosting the coordinator is restarted.
> This is the sequence of events that leads to the issue:
>  # The broker is the leader of one (or more) transaction state topic 
> partitions.
>  # The broker loses its ZK session due to a network issue.
>  # Broker reestablishes session with ZK, though there are still transient 
> network issues.
>  # Broker is made follower of the transaction state topic partition it was 
> leading earlier.
>  # During the become-follower transition, the broker loses its ZK session 
> again.
>  # The become-follower transition for this broker fails in-between, leaving 
> us in a partial leader / partial follower state for the transaction topic. 
> This meant that we could not unload the transaction metadata. However, the 
> broker successfully caches the leader epoch of associated with the 
> LeaderAndIsrRequest.
>  # Later, when the ZK session is finally established successfully, the broker 
> ignores the become-follower transition as the leader epoch was same as the 
> one it had cached. This prevented the transaction metadata from being 
> unloaded.
>  # Because this partition was a partial follower, we had setup replica 
> fetchers. The partition continued to fetch from the leader until it was made 
> part of the ISR.
>  # Once it was part of the ISR, preferred leader election kicked in and 
> elected this broker as the leader.
>  # When processing the become-leader transition, the transaction state load 
> operation failed as we already had transaction metadata loaded at a previous 
> epoch.
>  # This meant that this partition was left in the "loading" state and we thus 
> returned COORDINATOR_LOAD_IN_PROGRESS errors.
> Restarting the broker that hosts the transaction state coordinator is the 
> only way to recover from this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9062) Handle stalled writes to RocksDB

2019-12-20 Thread Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001259#comment-17001259
 ] 

Sophie Blee-Goldman commented on KAFKA-9062:


Just to clarify, the "simple workaround" wouldn't involve turning off batching, 
just autocompaction. I get the sense the suggested bulk loading configuration 
is aimed at the specific case where you would like to dump a large amount of 
data to rocks and then start querying it. This differs slightly from what 
Streams actually needs to do, which is restore all that data and then start 
_writing_ (and also reading) from it – it occurs to me that the bulk loading 
mode is not targeted at our specific use case, since queries would not be 
stalled by excessive L0 files/compaction, only writes.

That said, as mentioned above I think we should solve this holistically – 
stalled writes can still happen for other reasons. I just want to point out 
that the bulk loading mode may not be what we think it is

> Handle stalled writes to RocksDB
> 
>
> Key: KAFKA-9062
> URL: https://issues.apache.org/jira/browse/KAFKA-9062
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> RocksDB may stall writes at times when background compactions or flushes are 
> having trouble keeping up. This means we can effectively end up blocking 
> indefinitely during a StateStore#put call within Streams, and may get kicked 
> from the group if the throttling does not ease up within the max poll 
> interval.
> Example: when restoring large amounts of state from scratch, we use the 
> strategy recommended by RocksDB of turning off automatic compactions and 
> dumping everything into L0. We do batch somewhat, but do not sort these small 
> batches before loading into the db, so we end up with a large number of 
> unsorted L0 files.
> When restoration is complete and we toggle the db back to normal (not bulk 
> loading) settings, a background compaction is triggered to merge all these 
> into the next level. This background compaction can take a long time to merge 
> unsorted keys, especially when the amount of data is quite large.
> Any new writes while the number of L0 files exceeds the max will be stalled 
> until the compaction can finish, and processing after restoring from scratch 
> can block beyond the polling interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9173) StreamsPartitionAssignor assigns partitions to only one worker

2019-12-20 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001246#comment-17001246
 ] 

Matthias J. Sax commented on KAFKA-9173:


So should we close this as "not a problem" as the system behaves as designed?

> StreamsPartitionAssignor assigns partitions to only one worker
> --
>
> Key: KAFKA-9173
> URL: https://issues.apache.org/jira/browse/KAFKA-9173
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.3.0, 2.2.1
>Reporter: Oleg Muravskiy
>Priority: Major
>  Labels: user-experience
> Attachments: StreamsPartitionAssignor.log
>
>
> I'm running a distributed KafkaStreams application on 10 worker nodes, 
> subscribed to 21 topics with 10 partitions in each. I'm only using a 
> Processor interface, and a persistent state store.
> However, only one worker gets assigned partitions, all other workers get 
> nothing. Restarting the application, or cleaning local state stores does not 
> help. StreamsPartitionAssignor migrates to other nodes, and eventually picks 
> up other node to assign partitions to, but still only one node.
> It's difficult to figure out where to look for the signs of problems, I'm 
> attaching the log messages from the StreamsPartitionAssignor. Let me know 
> what else I could provide to help resolve this.
> [^StreamsPartitionAssignor.log]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9305) Add version 2.4 to streams system tests

2019-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001242#comment-17001242
 ] 

ASF GitHub Bot commented on KAFKA-9305:
---

mjsax commented on pull request #7841: KAFKA-9305: Add version 2.4 to Streams 
system tests
URL: https://github.com/apache/kafka/pull/7841
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Add version 2.4 to streams system tests
> 
>
> Key: KAFKA-9305
> URL: https://issues.apache.org/jira/browse/KAFKA-9305
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> cc [~mjsax]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4134) Transparently notify users of "Connection Refused" for client to broker connections

2019-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001172#comment-17001172
 ] 

ASF GitHub Bot commented on KAFKA-4134:
---

cotedm commented on pull request #1829: KAFKA-4134: log ConnectException at WARN
URL: https://github.com/apache/kafka/pull/1829
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Transparently notify users of "Connection Refused" for client to broker 
> connections
> ---
>
> Key: KAFKA-4134
> URL: https://issues.apache.org/jira/browse/KAFKA-4134
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer, producer 
>Affects Versions: 0.10.0.1
>Reporter: Dustin Cote
>Assignee: Dustin Cote
>Priority: Minor
>
> Currently, Producers and Consumers log at the WARN level if the bootstrap 
> server disconnects and if there is an unexpected exception in the network 
> Selector.  However, we log at DEBUG level if an IOException occurs in order 
> to prevent spamming the user with every network hiccup.  This has the side 
> effect of users making initial connections to brokers not getting any 
> feedback if the bootstrap server list is invalid.  For example, if one starts 
> the console producer or consumer up without any brokers running, nothing 
> indicates messages are not being received until the socket timeout is hit.
> I propose we be more granular and log the ConnectException to let the user 
> know their broker(s) are not reachable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-6817) UnknownProducerIdException when writing messages with old timestamps

2019-12-20 Thread Roman Schmitz (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000868#comment-17000868
 ] 

Roman Schmitz commented on KAFKA-6817:
--

Same problem here. We are reading 3 months old data for reloading or testing 
and this error occurs in the first seconds after starting the KStreams job. 
Retention is set to infinite (-1) for both time and bytes.

> UnknownProducerIdException when writing messages with old timestamps
> 
>
> Key: KAFKA-6817
> URL: https://issues.apache.org/jira/browse/KAFKA-6817
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Odin Standal
>Priority: Major
>
> We are seeing the following exception in our Kafka application: 
> {code:java}
> ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer 
> due to the following error: org.apache.kafka.streams.errors.StreamsException: 
> task [0_0] Abort sending since an error caught with a previous record (key 
> 22 value some-value timestamp 1519200902670) to topic 
> exactly-once-test-topic- v2 due to This exception is raised by the broker if 
> it could not locate the producer metadata associated with the producerId in 
> question. This could happen if, for instance, the producer's records were 
> deleted because their retention time had elapsed. Once the last records of 
> the producerId are removed, the producer's metadata is removed from the 
> broker, and future appends by the producer will return this exception. at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
>  at 
> org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
>  at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
>  at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627) 
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596) 
> at 
> org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74) 
> at 
> org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
>  at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101) 
> at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474) at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.kafka.common.errors.UnknownProducerIdException
> {code}
> We discovered this error when we had the need to reprocess old messages. See 
> more details on 
> [Stackoverflow|https://stackoverflow.com/questions/49872827/unknownproduceridexception-in-kafka-streams-when-enabling-exactly-once?noredirect=1#comment86901917_49872827]
> We have reproduced the error with a smaller example application. The error 
> occurs after 10 minutes of producing messages that have old timestamps (type 
> 1 year old). The topic we are writing to has a retention.ms set to 1 year so 
> we are expecting the messages to stay there.
> After digging through the ProducerStateManager-code in the Kafka source code 
> we have a theory of what might be wrong.
> The ProducerStateManager.removeExpiredProducers() seems to remove producers 
> from memory erroneously when processing records which are older than the 
> maxProducerIdExpirationMs (coming from the `transactional.id.expiration.ms` 
> configuration), which is set by default to 7 days. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-6985) Error connection between cluster node

2019-12-20 Thread Radoslaw Gasiorek (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000779#comment-17000779
 ] 

Radoslaw Gasiorek edited comment on KAFKA-6985 at 12/20/19 9:56 AM:


up, we had a MIM because of this (on Kafka .2.2)

likly root couse - a temp. latency/network  issue caused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. The initial problems started 
similarities to these
 https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584


was (Author: rgasiorek):
up, we had a MIM because of this (on Kafka .2.2)

likly root couse - a temp. latency/network  issue caused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. Possibly related to 
 https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584

> Error connection between cluster node
> -
>
> Key: KAFKA-6985
> URL: https://issues.apache.org/jira/browse/KAFKA-6985
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
> Environment: Centos-7
>Reporter: Ranjeet Ranjan
>Priority: Major
>
> Hi Have setup multi-node Kafka cluster but getting an error while connecting 
> one node to another although there is an issue with firewall or port. I am 
> able to telnet 
> WARN [ReplicaFetcherThread-0-1], Error in fetch 
> Kafka.server.ReplicaFetcherThread$FetchRequest@8395951 
> (Kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to Kafka-1:9092 (id: 1 rack: null) failed
>  
> {code:java}
>  
> at 
> kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84)
> at 
> kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94)
> at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:244)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> {code}
> Here you go server.properties
> Node:1
>  
> {code:java}
> # Server Basics #
> # The id of the broker. This must be set to a unique integer for each broker.
> broker.id=1
> # Switch to enable topic deletion or not, default value is false
> delete.topic.enable=true
> # Socket Server Settings 
> #
> listeners=PLAINTEXT://kafka-1:9092
> advertised.listeners=PLAINTEXT://kafka-1:9092
> #listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> # The number of threads handling network requests
> num.network.threads=3
> # The number of threads doing disk I/O
> num.io.threads=8
> # The send buffer (SO_SNDBUF) used by the socket server
> socket.send.buffer.bytes=102400
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=102400
> # The maximum size of a request that the socket server will accept 
> (protection against OOM)
> socket.request.max.bytes=104857600
> # Log Basics #
> # A comma seperated list of directories under which to store log files
> log.dirs=/var/log/kafka
> # The default number of log partitions per topic. More partitions allow 
> greater
> # parallelism for consumption, but this will also result in more files across
> # the brokers.
> num.partitions=1
> # The number of threads per data directory to be used for log recovery at 
> startup and flushing at shutdown.
> # This value is recommended to be increased for installations with data dirs 
> located in RAID array.
> num.recovery.threads.per.data.dir=1
> # Log Retention Policy 
> #
> # The minimum age of a log file to be eligible for deletion due to age
> log.retention.hours=48
> # A size-based retention policy for logs. Segments are pruned from the log as 
> long as the remaining
> # segments don't drop below log.retention.bytes. Functions independently of 
> log.retention.hours.
> log.retention.bytes=1073741824
> # The maximum size of a log segment file. When this size is reached a new log 
> segment will be created.
> log.segment.bytes=1073741824
> # The interval at which log segments are checked to see if they can be 
> deleted according
> # to the retention policies
> 

[jira] [Comment Edited] (KAFKA-6985) Error connection between cluster node

2019-12-20 Thread Radoslaw Gasiorek (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000779#comment-17000779
 ] 

Radoslaw Gasiorek edited comment on KAFKA-6985 at 12/20/19 9:55 AM:


up, we had a MIM because of this (on Kafka .2.2)

likly root couse - a temp. latency/network  issue caused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. Possibly related to 
 https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584


was (Author: rgasiorek):
up, we had a MIM because of this (kafka .2.2)

likly root couse - a temp. latency/network  issue coused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. Possibly related to 
 https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584

> Error connection between cluster node
> -
>
> Key: KAFKA-6985
> URL: https://issues.apache.org/jira/browse/KAFKA-6985
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
> Environment: Centos-7
>Reporter: Ranjeet Ranjan
>Priority: Major
>
> Hi Have setup multi-node Kafka cluster but getting an error while connecting 
> one node to another although there is an issue with firewall or port. I am 
> able to telnet 
> WARN [ReplicaFetcherThread-0-1], Error in fetch 
> Kafka.server.ReplicaFetcherThread$FetchRequest@8395951 
> (Kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to Kafka-1:9092 (id: 1 rack: null) failed
>  
> {code:java}
>  
> at 
> kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84)
> at 
> kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94)
> at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:244)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> {code}
> Here you go server.properties
> Node:1
>  
> {code:java}
> # Server Basics #
> # The id of the broker. This must be set to a unique integer for each broker.
> broker.id=1
> # Switch to enable topic deletion or not, default value is false
> delete.topic.enable=true
> # Socket Server Settings 
> #
> listeners=PLAINTEXT://kafka-1:9092
> advertised.listeners=PLAINTEXT://kafka-1:9092
> #listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> # The number of threads handling network requests
> num.network.threads=3
> # The number of threads doing disk I/O
> num.io.threads=8
> # The send buffer (SO_SNDBUF) used by the socket server
> socket.send.buffer.bytes=102400
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=102400
> # The maximum size of a request that the socket server will accept 
> (protection against OOM)
> socket.request.max.bytes=104857600
> # Log Basics #
> # A comma seperated list of directories under which to store log files
> log.dirs=/var/log/kafka
> # The default number of log partitions per topic. More partitions allow 
> greater
> # parallelism for consumption, but this will also result in more files across
> # the brokers.
> num.partitions=1
> # The number of threads per data directory to be used for log recovery at 
> startup and flushing at shutdown.
> # This value is recommended to be increased for installations with data dirs 
> located in RAID array.
> num.recovery.threads.per.data.dir=1
> # Log Retention Policy 
> #
> # The minimum age of a log file to be eligible for deletion due to age
> log.retention.hours=48
> # A size-based retention policy for logs. Segments are pruned from the log as 
> long as the remaining
> # segments don't drop below log.retention.bytes. Functions independently of 
> log.retention.hours.
> log.retention.bytes=1073741824
> # The maximum size of a log segment file. When this size is reached a new log 
> segment will be created.
> log.segment.bytes=1073741824
> # The interval at which log segments are checked to see if they can be 
> deleted according
> # to the retention policies
> log.retention.check.interval.ms=30
> 

[jira] [Comment Edited] (KAFKA-6985) Error connection between cluster node

2019-12-20 Thread Radoslaw Gasiorek (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000779#comment-17000779
 ] 

Radoslaw Gasiorek edited comment on KAFKA-6985 at 12/20/19 9:55 AM:


up, we had a MIM because of this (kafka .2.2)

likly root couse - a temp. latency/network  issue coused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. Possibly related to 
 https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584


was (Author: rgasiorek):
up, we had a 6h MIM because of this (kafka .2.2)

likly root couse - a temp. latency/network  issue coused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. Possibly related to 
https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584

> Error connection between cluster node
> -
>
> Key: KAFKA-6985
> URL: https://issues.apache.org/jira/browse/KAFKA-6985
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
> Environment: Centos-7
>Reporter: Ranjeet Ranjan
>Priority: Major
>
> Hi Have setup multi-node Kafka cluster but getting an error while connecting 
> one node to another although there is an issue with firewall or port. I am 
> able to telnet 
> WARN [ReplicaFetcherThread-0-1], Error in fetch 
> Kafka.server.ReplicaFetcherThread$FetchRequest@8395951 
> (Kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to Kafka-1:9092 (id: 1 rack: null) failed
>  
> {code:java}
>  
> at 
> kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84)
> at 
> kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94)
> at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:244)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> {code}
> Here you go server.properties
> Node:1
>  
> {code:java}
> # Server Basics #
> # The id of the broker. This must be set to a unique integer for each broker.
> broker.id=1
> # Switch to enable topic deletion or not, default value is false
> delete.topic.enable=true
> # Socket Server Settings 
> #
> listeners=PLAINTEXT://kafka-1:9092
> advertised.listeners=PLAINTEXT://kafka-1:9092
> #listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> # The number of threads handling network requests
> num.network.threads=3
> # The number of threads doing disk I/O
> num.io.threads=8
> # The send buffer (SO_SNDBUF) used by the socket server
> socket.send.buffer.bytes=102400
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=102400
> # The maximum size of a request that the socket server will accept 
> (protection against OOM)
> socket.request.max.bytes=104857600
> # Log Basics #
> # A comma seperated list of directories under which to store log files
> log.dirs=/var/log/kafka
> # The default number of log partitions per topic. More partitions allow 
> greater
> # parallelism for consumption, but this will also result in more files across
> # the brokers.
> num.partitions=1
> # The number of threads per data directory to be used for log recovery at 
> startup and flushing at shutdown.
> # This value is recommended to be increased for installations with data dirs 
> located in RAID array.
> num.recovery.threads.per.data.dir=1
> # Log Retention Policy 
> #
> # The minimum age of a log file to be eligible for deletion due to age
> log.retention.hours=48
> # A size-based retention policy for logs. Segments are pruned from the log as 
> long as the remaining
> # segments don't drop below log.retention.bytes. Functions independently of 
> log.retention.hours.
> log.retention.bytes=1073741824
> # The maximum size of a log segment file. When this size is reached a new log 
> segment will be created.
> log.segment.bytes=1073741824
> # The interval at which log segments are checked to see if they can be 
> deleted according
> # to the retention policies
> log.retention.check.interval.ms=30
> 

[jira] [Commented] (KAFKA-6985) Error connection between cluster node

2019-12-20 Thread Radoslaw Gasiorek (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000779#comment-17000779
 ] 

Radoslaw Gasiorek commented on KAFKA-6985:
--

up, we had a 6h MIM because of this (kafka .2.2)

likly root couse - a temp. latency/network  issue coused some broker nodes 
disconnecting from the cluster which never fully joining back.  In the same 
time remaining leaders for some partitions. Possibly related to 
https://issues.apache.org/jira/browse/KAFKA-7165 and 
https://issues.apache.org/jira/browse/KAFKA-6584

> Error connection between cluster node
> -
>
> Key: KAFKA-6985
> URL: https://issues.apache.org/jira/browse/KAFKA-6985
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
> Environment: Centos-7
>Reporter: Ranjeet Ranjan
>Priority: Major
>
> Hi Have setup multi-node Kafka cluster but getting an error while connecting 
> one node to another although there is an issue with firewall or port. I am 
> able to telnet 
> WARN [ReplicaFetcherThread-0-1], Error in fetch 
> Kafka.server.ReplicaFetcherThread$FetchRequest@8395951 
> (Kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to Kafka-1:9092 (id: 1 rack: null) failed
>  
> {code:java}
>  
> at 
> kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84)
> at 
> kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94)
> at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:244)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> {code}
> Here you go server.properties
> Node:1
>  
> {code:java}
> # Server Basics #
> # The id of the broker. This must be set to a unique integer for each broker.
> broker.id=1
> # Switch to enable topic deletion or not, default value is false
> delete.topic.enable=true
> # Socket Server Settings 
> #
> listeners=PLAINTEXT://kafka-1:9092
> advertised.listeners=PLAINTEXT://kafka-1:9092
> #listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> # The number of threads handling network requests
> num.network.threads=3
> # The number of threads doing disk I/O
> num.io.threads=8
> # The send buffer (SO_SNDBUF) used by the socket server
> socket.send.buffer.bytes=102400
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=102400
> # The maximum size of a request that the socket server will accept 
> (protection against OOM)
> socket.request.max.bytes=104857600
> # Log Basics #
> # A comma seperated list of directories under which to store log files
> log.dirs=/var/log/kafka
> # The default number of log partitions per topic. More partitions allow 
> greater
> # parallelism for consumption, but this will also result in more files across
> # the brokers.
> num.partitions=1
> # The number of threads per data directory to be used for log recovery at 
> startup and flushing at shutdown.
> # This value is recommended to be increased for installations with data dirs 
> located in RAID array.
> num.recovery.threads.per.data.dir=1
> # Log Retention Policy 
> #
> # The minimum age of a log file to be eligible for deletion due to age
> log.retention.hours=48
> # A size-based retention policy for logs. Segments are pruned from the log as 
> long as the remaining
> # segments don't drop below log.retention.bytes. Functions independently of 
> log.retention.hours.
> log.retention.bytes=1073741824
> # The maximum size of a log segment file. When this size is reached a new log 
> segment will be created.
> log.segment.bytes=1073741824
> # The interval at which log segments are checked to see if they can be 
> deleted according
> # to the retention policies
> log.retention.check.interval.ms=30
> # Zookeeper #
> # root directory for all kafka znodes.
> zookeeper.connect=10.130.82.28:2181
> # Timeout in ms for connecting to zookeeper
> zookeeper.connection.timeout.ms=6000
> {code}
>  
>  
> Node-2
> {code:java}
> # Server Basics #
> # The id of the broker. This must be set to a unique integer for each broker.
> broker.id=2
> # Switch to 

[jira] [Resolved] (KAFKA-9239) Extreme amounts of logging done by unauthorized Kafka clients

2019-12-20 Thread Anders Eknert (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anders Eknert resolved KAFKA-9239.
--
Resolution: Fixed

> Extreme amounts of logging done by unauthorized Kafka clients
> -
>
> Key: KAFKA-9239
> URL: https://issues.apache.org/jira/browse/KAFKA-9239
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Anders Eknert
>Priority: Major
> Attachments: Screenshot 2019-11-27 at 11.32.38.png
>
>
> Having experimented some with custom authorization options for Kafka on the 
> broker side, we have a bunch of clients that are no longer authorized. While 
> that's expected and fine, we did not anticipate the level of logging that 
> these unauthorized clients would spew out - putting our whole logging 
> subsystem under heavy stress.
> The message log is similar to the one below:
> {code:java}
> 2019-11-25 10:08:10.262  WARN 1 --- [ntainer#0-0-C-1] 
> o.a.k.c.consumer.internals.Fetcher   : [Consumer clientId=sdp-ee-miami-0, 
> groupId=sdp-ee-miami] Not authorized to read from topic sdp.ee-miami.
> {code}
> In just 4 hours this same message was repeated about a hundred million times 
> ( ! ) in the worst offending client, 74 million times in the next one and 72 
> million times in the third.
> We will roll out customized burst filters to suppress this on the client 
> loggers, but it would of course be best if this was fixed in the client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9239) Extreme amounts of logging done by unauthorized Kafka clients

2019-12-20 Thread Anders Eknert (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000733#comment-17000733
 ] 

Anders Eknert commented on KAFKA-9239:
--

This turned out to be a bug in the Spring Kafka client library which has been 
[fixed|https://github.com/spring-projects/spring-kafka/pull/1337]. Closing as 
resolved.

> Extreme amounts of logging done by unauthorized Kafka clients
> -
>
> Key: KAFKA-9239
> URL: https://issues.apache.org/jira/browse/KAFKA-9239
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Anders Eknert
>Priority: Major
> Attachments: Screenshot 2019-11-27 at 11.32.38.png
>
>
> Having experimented some with custom authorization options for Kafka on the 
> broker side, we have a bunch of clients that are no longer authorized. While 
> that's expected and fine, we did not anticipate the level of logging that 
> these unauthorized clients would spew out - putting our whole logging 
> subsystem under heavy stress.
> The message log is similar to the one below:
> {code:java}
> 2019-11-25 10:08:10.262  WARN 1 --- [ntainer#0-0-C-1] 
> o.a.k.c.consumer.internals.Fetcher   : [Consumer clientId=sdp-ee-miami-0, 
> groupId=sdp-ee-miami] Not authorized to read from topic sdp.ee-miami.
> {code}
> In just 4 hours this same message was repeated about a hundred million times 
> ( ! ) in the worst offending client, 74 million times in the next one and 72 
> million times in the third.
> We will roll out customized burst filters to suppress this on the client 
> loggers, but it would of course be best if this was fixed in the client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)