[jira] [Commented] (KAFKA-9535) Metadata not updated when consumer encounters FENCED_LEADER_EPOCH

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034172#comment-17034172
 ] 

ASF GitHub Bot commented on KAFKA-9535:
---

abbccdda commented on pull request #8088: KAFKA-9535: Update metadata upon 
receiving FENCED_LEADER_EPOCH in ListOffset
URL: https://github.com/apache/kafka/pull/8088
 
 
   Today if we attempt to list offsets with a fenced leader epoch, consumer 
will infinitely retry without updating the metadata. The fix is to trigger the 
metadata update call when we see `FENCED_LEADER_EPOCH`, even as partial failure.
   `UNKNOWN_LEADER_EPOCH`, on the other hand, indicates a metadata staleness on 
broker side, so consumer don't have to update metadata.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Metadata not updated when consumer encounters FENCED_LEADER_EPOCH
> -
>
> Key: KAFKA-9535
> URL: https://issues.apache.org/jira/browse/KAFKA-9535
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> Inside the consumer Fetcher's handling of ListOffsetResponse, if we hit 
> `FENCED_LEADER_EPOCH` on partition level, the client will blindly retry 
> without refreshing the metadata, creating a stuck state as the local leader 
> epoch never gets updated and constantly fails the broker check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9525) Allow explicit rebalance triggering on the Consumer

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034116#comment-17034116
 ] 

ASF GitHub Bot commented on KAFKA-9525:
---

ableegoldman commented on pull request #8087: KAFKA-9525: add enforceRebalance 
method to Consumer API
URL: https://github.com/apache/kafka/pull/8087
 
 
   As described in 
[KIP-568](https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer).
   
   Waiting on acceptance of the KIP to write the tests, on the off chance 
something changes. But rest assured unit tests are coming ⚡️ 
   
   Will also kick off existing Streams system tests which leverage this new API 
(eg version probing, sometimes broker bounce)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow explicit rebalance triggering on the Consumer
> ---
>
> Key: KAFKA-9525
> URL: https://issues.apache.org/jira/browse/KAFKA-9525
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Sophie Blee-Goldman
>Assignee: Sophie Blee-Goldman
>Priority: Major
>  Labels: needs-kip
>
> Currently the only way to explicitly trigger a rebalance is by unsubscribing 
> the consumer. This has two drawbacks: it does not work with static 
> membership, and it causes the consumer to revoke all its currently owned 
> partitions. Streams relies on being able to enforce a rebalance for its 
> version probing upgrade protocol and the upcoming KIP-441, both of which 
> should be able to work with static membership and be able to leverage the 
> improvements of KIP-429 to no longer revoke all owned partitions.
> We should add an API that will allow users to explicitly trigger a rebalance 
> without going through #unsubscribe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9047) AdminClient group operations may not respect backoff

2020-02-10 Thread Sanjana Kaundinya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjana Kaundinya reassigned KAFKA-9047:


Assignee: Sanjana Kaundinya  (was: Vikas Singh)

> AdminClient group operations may not respect backoff
> 
>
> Key: KAFKA-9047
> URL: https://issues.apache.org/jira/browse/KAFKA-9047
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Reporter: Jason Gustafson
>Assignee: Sanjana Kaundinya
>Priority: Major
>
> The retry logic for consumer group operations in the admin client is 
> complicated by the need to find the coordinator. Instead of simply retry 
> loops which send the same request over and over, we can get more complex 
> retry loops like the following:
>  # Send FindCoordinator to B -> Coordinator is A
>  # Send DescribeGroup to A -> NOT_COORDINATOR
>  # Go back to 1
> Currently we construct a new Call object for each step in this loop, which 
> means we lose some of retry bookkeeping such as the last retry time and the 
> number of tries. This means it is possible to have tight retry loops which 
> bounce between steps 1 and 2 and do not respect the retry backoff. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-8403) Suppress needs a Materialized variant

2020-02-10 Thread Dongjin Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjin Lee reassigned KAFKA-8403:
--

Assignee: Dongjin Lee

> Suppress needs a Materialized variant
> -
>
> Key: KAFKA-8403
> URL: https://issues.apache.org/jira/browse/KAFKA-8403
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Assignee: Dongjin Lee
>Priority: Major
>  Labels: needs-kip
>
> WIP KIP: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-508%3A+Make+Suppression+State+Queriable]
> The newly added KTable Suppress operator lacks a Materialized variant, which 
> would be useful if you wanted to query the results of the suppression.
> Suppression results will eventually match the upstream results, but the 
> intermediate distinction may be meaningful for some applications. For 
> example, you could want to query only the final results of a windowed 
> aggregation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9536) Integration tests for KIP-558

2020-02-10 Thread Konstantine Karantasis (Jira)
Konstantine Karantasis created KAFKA-9536:
-

 Summary: Integration tests for KIP-558
 Key: KAFKA-9536
 URL: https://issues.apache.org/jira/browse/KAFKA-9536
 Project: Kafka
  Issue Type: Test
Reporter: Konstantine Karantasis
Assignee: Konstantine Karantasis


Extend testing coverage for 
[KIP-558|https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect]
 with integration tests and additional unit tests. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9440) Add ConsumerGroupCommand to delete static members

2020-02-10 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034056#comment-17034056
 ] 

Jason Gustafson commented on KAFKA-9440:


[~xuel1] Feel free to pick this up if you have time. Here is the link to KIP 
guidelines: 
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals.

> Add ConsumerGroupCommand to delete static members
> -
>
> Key: KAFKA-9440
> URL: https://issues.apache.org/jira/browse/KAFKA-9440
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Boyang Chen
>Priority: Major
>  Labels: help-wanted, kip, newbie, newbie++
>
> We introduced a new AdminClient API removeMembersFromConsumerGroup in 2.4. It 
> would be good to instantiate the API as part of the ConsumerGroupCommand for 
> easy command line usage. 
> This change requires a new KIP, and just posting out here in case anyone who 
> uses static membership to pick it up, if they would like to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9181) Flaky test kafka.api.SaslGssapiSslEndToEndAuthorizationTest.testNoConsumeWithoutDescribeAclViaSubscribe

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034043#comment-17034043
 ] 

ASF GitHub Bot commented on KAFKA-9181:
---

soondenana commented on pull request #8084: KAFKA-9181; Maintain clean 
separation between local and group subscriptions in consumer's 
SubscriptionState (#7941)
URL: https://github.com/apache/kafka/pull/8084
 
 
   
   Reviewers: Jason Gustafson , Guozhang Wang 

   (cherry picked from commit a565d1a182cc69c9994c4512b5e9877e97f06cdf)
   
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky test 
> kafka.api.SaslGssapiSslEndToEndAuthorizationTest.testNoConsumeWithoutDescribeAclViaSubscribe
> ---
>
> Key: KAFKA-9181
> URL: https://issues.apache.org/jira/browse/KAFKA-9181
> Project: Kafka
>  Issue Type: Test
>  Components: core
>Reporter: Bill Bejeck
>Assignee: Rajini Sivaram
>Priority: Major
>  Labels: flaky-test, tests
> Fix For: 2.5.0
>
>
> Failed in 
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/26571/testReport/junit/kafka.api/SaslGssapiSslEndToEndAuthorizationTest/testNoConsumeWithoutDescribeAclViaSubscribe/]
>  
> {noformat}
> Error Messageorg.apache.kafka.common.errors.TopicAuthorizationException: Not 
> authorized to access topics: 
> [topic2]Stacktraceorg.apache.kafka.common.errors.TopicAuthorizationException: 
> Not authorized to access topics: [topic2]
> Standard OutputAdding ACLs for resource 
> `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, 
> patternType=LITERAL)`: 
>   (principal=User:kafka, host=*, operation=CLUSTER_ACTION, 
> permissionType=ALLOW) 
> Current ACLs for resource `Cluster:LITERAL:kafka-cluster`: 
>   User:kafka has Allow permission for operations: ClusterAction from 
> hosts: * 
> Adding ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, 
> patternType=LITERAL)`: 
>   (principal=User:kafka, host=*, operation=READ, permissionType=ALLOW) 
> Current ACLs for resource `Topic:LITERAL:*`: 
>   User:kafka has Allow permission for operations: Read from hosts: * 
> Debug is  true storeKey true useTicketCache false useKeyTab true doNotPrompt 
> false ticketCache is null isInitiator true KeyTab is 
> /tmp/kafka6494439724844851846.tmp refreshKrb5Config is false principal is 
> kafka/localh...@example.com tryFirstPass is false useFirstPass is false 
> storePass is false clearPass is false
> principal is kafka/localh...@example.com
> Will use keytab
> Commit Succeeded 
> [2019-11-13 04:43:16,187] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-13 04:43:16,191] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-13 04:43:16,384] ERROR [ReplicaFetcher replicaId=1, leaderId=2, 
> fetcherId=0] Error for partition e2etopic-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-13 04:43:16,384] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error for partition e2etopic-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> Adding ACLs for resource `ResourcePattern(resourceType=TOPIC, name=e2etopic, 
> patternType=LITERAL)`: 
>   (principal=User:client, host=*, operation=WRITE, 

[jira] [Updated] (KAFKA-9535) Metadata not updated when consumer encounters FENCED_LEADER_EPOCH

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9535:
---
Summary: Metadata not updated when consumer encounters FENCED_LEADER_EPOCH  
(was: Metadata not updated when consumer encounters leader epoch related 
failures)

> Metadata not updated when consumer encounters FENCED_LEADER_EPOCH
> -
>
> Key: KAFKA-9535
> URL: https://issues.apache.org/jira/browse/KAFKA-9535
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> Inside the consumer Fetcher's handling of ListOffsetResponse, if we hit 
> `FENCED_LEADER_EPOCH` on partition level, the client will blindly retry 
> without refreshing the metadata, creating a stuck state as the local leader 
> epoch never gets updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9535) Metadata not updated when consumer encounters FENCED_LEADER_EPOCH

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9535:
---
Description: Inside the consumer Fetcher's handling of ListOffsetResponse, 
if we hit `FENCED_LEADER_EPOCH` on partition level, the client will blindly 
retry without refreshing the metadata, creating a stuck state as the local 
leader epoch never gets updated and constantly fails the broker check.  (was: 
Inside the consumer Fetcher's handling of ListOffsetResponse, if we hit 
`FENCED_LEADER_EPOCH` on partition level, the client will blindly retry without 
refreshing the metadata, creating a stuck state as the local leader epoch never 
gets updated.)

> Metadata not updated when consumer encounters FENCED_LEADER_EPOCH
> -
>
> Key: KAFKA-9535
> URL: https://issues.apache.org/jira/browse/KAFKA-9535
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> Inside the consumer Fetcher's handling of ListOffsetResponse, if we hit 
> `FENCED_LEADER_EPOCH` on partition level, the client will blindly retry 
> without refreshing the metadata, creating a stuck state as the local leader 
> epoch never gets updated and constantly fails the broker check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9535) Metadata not updated when consumer encounters leader epoch related failures

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9535:
---
Description: Inside the consumer Fetcher's handling of ListOffsetResponse, 
if we hit `FENCED_LEADER_EPOCH` on partition level, the client will blindly 
retry without refreshing the metadata, creating a stuck state as the local 
leader epoch never gets updated.

> Metadata not updated when consumer encounters leader epoch related failures
> ---
>
> Key: KAFKA-9535
> URL: https://issues.apache.org/jira/browse/KAFKA-9535
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> Inside the consumer Fetcher's handling of ListOffsetResponse, if we hit 
> `FENCED_LEADER_EPOCH` on partition level, the client will blindly retry 
> without refreshing the metadata, creating a stuck state as the local leader 
> epoch never gets updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9535) Metadata not updated when consumer encounters leader epoch related failures

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9535:
---
Affects Version/s: 2.5.0
   2.3.0
   2.4.0

> Metadata not updated when consumer encounters leader epoch related failures
> ---
>
> Key: KAFKA-9535
> URL: https://issues.apache.org/jira/browse/KAFKA-9535
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0, 2.4.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9374) Worker can be disabled by blocked connectors

2020-02-10 Thread Chris Egerton (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034006#comment-17034006
 ] 

Chris Egerton commented on KAFKA-9374:
--

Hi [~tombentley], sorry for the delay in reply.

 

After some thought, it seems that setting a timeout on interactions with 
connectors but keeping those actions synchronous within the herder's tick 
method isn't really a viable approach. There doesn't seem to be a good value to 
use for that timeout; if it's too conservative it may be impossible to start 
some connectors that have to do heavy-duty initialization on startup, and if 
it's too liberal there will still be the original problem (for however long the 
timeout is) of the worker being effectively disabled during that period, and 
potentially even dropping out of the group due.

Instead, in [https://github.com/apache/kafka/pull/8069], I've made changes to 
make most connector interactions (specifically, calls to the start, {{stop}}, 
{{config}}, {{validate}}, and {{initialize}} methods) completely asynchronous 
and handle any follow-up logic via callback. In the {{DistributedHerder}} 
class, this callback adds a new herder request to the queue, which helps keep 
the class thread-safe and preserves some of the guarantees provided by the 
current {{tick}} model.

Unfortunately, this means that status tracking for connectors becomes... 
difficult. If we don't establish a timeout for any of our connector 
interactions, we also then don't have a good metric for know if/when to update 
the status of a connector to {{FAILED}}. At this point, the best we may be able 
to do is include log messages detailing when certain connector interactions are 
scheduled, and when those interactions are complete. That should at least 
provide a decent method for diagnosing via log files whether a connector is 
blocking and effectively a zombie. In the future, a KIP may be warranted for 
adding a new metric to track the number and types of zombie connectors/tasks.

This also still leaves the door open for zombie thread creation; any connector 
that blocks in any of the aforementioned methods will still be taking up a 
thread until/unless it returns control to the framework.

> Worker can be disabled by blocked connectors
> 
>
> Key: KAFKA-9374
> URL: https://issues.apache.org/jira/browse/KAFKA-9374
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, 
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1
>Reporter: Chris Egerton
>Assignee: Chris Egerton
>Priority: Major
>
> If a connector hangs during any of its {{initialize}}, {{start}}, {{stop}}, 
> \{taskConfigs}}, {{taskClass}}, {{version}}, {{config}}, or {{validate}} 
> methods, the worker will be disabled for some types of requests thereafter, 
> including connector creation, connector reconfiguration, and connector 
> deletion.
>  -This only occurs in distributed mode and is due to the threading model used 
> by the 
> [DistributedHerder|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java]
>  class.- This affects both distributed and standalone mode. Distributed 
> herders perform some connector work synchronously in their {{tick}} thread, 
> which also handles group membership and some REST requests. The majority of 
> the herder methods for the standalone herder are {{synchronized}}, including 
> those for creating, updating, and deleting connectors; as long as one of 
> those methods blocks, all subsequent calls to any of these methods will also 
> be blocked.
>  
> One potential solution could be to treat connectors that fail to start, stop, 
> etc. in time similarly to tasks that fail to stop within the [task graceful 
> shutdown timeout 
> period|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerConfig.java#L121-L126]
>  by handling all connector interactions on a separate thread, waiting for 
> them to complete within a timeout, and abandoning the thread (and 
> transitioning the connector to the {{FAILED}} state, if it has been created 
> at all) if that timeout expires.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9535) Metadata not updated when consumer encounters leader epoch related failures

2020-02-10 Thread Boyang Chen (Jira)
Boyang Chen created KAFKA-9535:
--

 Summary: Metadata not updated when consumer encounters leader 
epoch related failures
 Key: KAFKA-9535
 URL: https://issues.apache.org/jira/browse/KAFKA-9535
 Project: Kafka
  Issue Type: Bug
Reporter: Boyang Chen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7061) Enhanced log compaction

2020-02-10 Thread Senthilnathan Muthusamy (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034004#comment-17034004
 ] 

Senthilnathan Muthusamy commented on KAFKA-7061:


[~guozhang] [~junrao] [~mjsax] looking to get this moving on... waiting for 
long time on the code review... can you please help moving. 12th is the code 
freeze date, right? are you guys thinking we can't make it for the 2.5 release?

> Enhanced log compaction
> ---
>
> Key: KAFKA-7061
> URL: https://issues.apache.org/jira/browse/KAFKA-7061
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.5.0
>Reporter: Luis Cabral
>Assignee: Senthilnathan Muthusamy
>Priority: Major
>  Labels: kip
>
> Enhance log compaction to support more than just offset comparison, so the 
> insertion order isn't dictating which records to keep.
> Default behavior is kept as it was, with the enhanced approached having to be 
> purposely activated.
>  The enhanced compaction is done either via the record timestamp, by settings 
> the new configuration as "timestamp" or via the record headers by setting 
> this configuration to anything other than the default "offset" or the 
> reserved "timestamp".
> See 
> [KIP-280|https://cwiki.apache.org/confluence/display/KAFKA/KIP-280%3A+Enhanced+log+compaction]
>  for more details.
> +From Guozhang:+ We should emphasize on the WIKI that the newly introduced 
> config yields to the existing "log.cleanup.policy", i.e. if the latter's 
> value is `delete` not `compact`, then the previous config would be ignored.
> +From Jun Rao:+ With the timestamp/header strategy, the behavior of the 
> application may need to change. In particular, the application can't just 
> blindly take the record with a larger offset and assuming that it's the value 
> to keep. It needs to check the timestamp or the header now. So, it would be 
> useful to at least document this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9534) Topics could not be deleted when there is a concurrent create topic request loop

2020-02-10 Thread Boyang Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033997#comment-17033997
 ] 

Boyang Chen commented on KAFKA-9534:


The reproduction is not very consistent, will take some time to further debug.

> Topics could not be deleted when there is a concurrent create topic request 
> loop
> 
>
> Key: KAFKA-9534
> URL: https://issues.apache.org/jira/browse/KAFKA-9534
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.3.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> The reproduce steps:
>  # start local ZK
>  # start local broker
>  #  Run the following script which keeps creating an input topic until 
> success:
>  
> {code:java}
> package kafka.examples;
> import org.apache.kafka.clients.admin.Admin;
> import org.apache.kafka.clients.admin.NewTopic;
> import org.apache.kafka.clients.consumer.ConsumerConfig;
> import org.apache.kafka.common.errors.TopicExistsException;
> import java.util.Arrays;
> import java.util.List;
> import java.util.Properties;
> import java.util.concurrent.CountDownLatch;
> import java.util.concurrent.ExecutionException;
> public class Reproduce {
> public static void main(String[] args) throws ExecutionException, 
> InterruptedException {
> Properties props = new Properties();
> props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
> KafkaProperties.KAFKA_SERVER_URL + ":" + 
> KafkaProperties.KAFKA_SERVER_PORT);
> Admin adminClient = Admin.create(props);
> createTopics(adminClient);
> CountDownLatch createSucceed = new CountDownLatch(1);
> Thread deleteTopicThread = new Thread(() -> {
> List topicsToDelete = Arrays.asList("input-topic", 
> "output-topic");
> while (true) {
> try {
> Thread.sleep(1000);
> adminClient.deleteTopics(topicsToDelete).all().get();
> if (createSucceed.getCount() == 0) {
> break;
> }
> } catch (ExecutionException | InterruptedException e) {
> System.out.println("Encountered exception during topic 
> deletion: " + e.getCause());
> }
> }
> System.out.println("Deleted old topics: " + topicsToDelete);
>  });
> deleteTopicThread.start();
> while (true) {
> try {
> createTopics(adminClient);
> System.out.println("Created new topic!");
> break;
> } catch (ExecutionException | InterruptedException e) {
> if (!(e.getCause() instanceof TopicExistsException)) {
> throw e;
> }
> System.out.println("Metadata of the old topics are not 
> cleared yet... " + e.getMessage());
> Thread.sleep(1000);
> }
> }
> createSucceed.countDown();
> deleteTopicThread.join();
> }
> private static void createTopics(Admin adminClient) throws 
> InterruptedException, ExecutionException {
> adminClient.createTopics(Arrays.asList(
> new NewTopic("input-topic", 1, (short) 1),
> new NewTopic("output-topic", 1, (short) 1))).all().get();
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9534) Topics could not be deleted when there is a concurrent create topic request loop

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9534:
---
Description: 
The reproduce steps:
 # start local ZK
 # start local broker
 #  Run the following script which keeps creating an input topic until success:


 
{code:java}
package kafka.examples;

import org.apache.kafka.clients.admin.Admin;
import org.apache.kafka.clients.admin.NewTopic;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.errors.TopicExistsException;

import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutionException;

public class Reproduce {

public static void main(String[] args) throws ExecutionException, 
InterruptedException {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
KafkaProperties.KAFKA_SERVER_URL + ":" + 
KafkaProperties.KAFKA_SERVER_PORT);

Admin adminClient = Admin.create(props);

createTopics(adminClient);
CountDownLatch createSucceed = new CountDownLatch(1);
Thread deleteTopicThread = new Thread(() -> {
List topicsToDelete = Arrays.asList("input-topic", 
"output-topic");
while (true) {
try {
Thread.sleep(1000);
adminClient.deleteTopics(topicsToDelete).all().get();
if (createSucceed.getCount() == 0) {
break;
}
} catch (ExecutionException | InterruptedException e) {
System.out.println("Encountered exception during topic 
deletion: " + e.getCause());
}
}
System.out.println("Deleted old topics: " + topicsToDelete);
 });
deleteTopicThread.start();

while (true) {
try {
createTopics(adminClient);
System.out.println("Created new topic!");
break;
} catch (ExecutionException | InterruptedException e) {
if (!(e.getCause() instanceof TopicExistsException)) {
throw e;
}
System.out.println("Metadata of the old topics are not cleared 
yet... " + e.getMessage());

Thread.sleep(1000);
}
}
createSucceed.countDown();
deleteTopicThread.join();
}

private static void createTopics(Admin adminClient) throws 
InterruptedException, ExecutionException {
adminClient.createTopics(Arrays.asList(
new NewTopic("input-topic", 1, (short) 1),
new NewTopic("output-topic", 1, (short) 1))).all().get();
}
}

{code}

  was:Not sure this is indeed a bug, just making it under track.


> Topics could not be deleted when there is a concurrent create topic request 
> loop
> 
>
> Key: KAFKA-9534
> URL: https://issues.apache.org/jira/browse/KAFKA-9534
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.3.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> The reproduce steps:
>  # start local ZK
>  # start local broker
>  #  Run the following script which keeps creating an input topic until 
> success:
>  
> {code:java}
> package kafka.examples;
> import org.apache.kafka.clients.admin.Admin;
> import org.apache.kafka.clients.admin.NewTopic;
> import org.apache.kafka.clients.consumer.ConsumerConfig;
> import org.apache.kafka.common.errors.TopicExistsException;
> import java.util.Arrays;
> import java.util.List;
> import java.util.Properties;
> import java.util.concurrent.CountDownLatch;
> import java.util.concurrent.ExecutionException;
> public class Reproduce {
> public static void main(String[] args) throws ExecutionException, 
> InterruptedException {
> Properties props = new Properties();
> props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
> KafkaProperties.KAFKA_SERVER_URL + ":" + 
> KafkaProperties.KAFKA_SERVER_PORT);
> Admin adminClient = Admin.create(props);
> createTopics(adminClient);
> CountDownLatch createSucceed = new CountDownLatch(1);
> Thread deleteTopicThread = new Thread(() -> {
> List topicsToDelete = Arrays.asList("input-topic", 
> "output-topic");
> while (true) {
> try {
> Thread.sleep(1000);
> adminClient.deleteTopics(topicsToDelete).all().get();
> if (createSucceed.getCount() == 0) {
> break;
> }
> } catch (ExecutionException | InterruptedException e) {
> 

[jira] [Updated] (KAFKA-9534) Topics could not be deleted when there is a concurrent create topic request loop

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9534:
---
Affects Version/s: 2.3.0

> Topics could not be deleted when there is a concurrent create topic request 
> loop
> 
>
> Key: KAFKA-9534
> URL: https://issues.apache.org/jira/browse/KAFKA-9534
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.3.0, 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> Not sure this is indeed a bug, just making it under track.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9534) Topics could not be deleted when there is a concurrent create topic request loop

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9534:
---
Summary: Topics could not be deleted when there is a concurrent create 
topic request loop  (was: Potential inconsistent result for listTopic talking 
to older brokers)

> Topics could not be deleted when there is a concurrent create topic request 
> loop
> 
>
> Key: KAFKA-9534
> URL: https://issues.apache.org/jira/browse/KAFKA-9534
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.5.0
>Reporter: Boyang Chen
>Priority: Major
>
> Not sure this is indeed a bug, just making it under track.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9517) KTable Joins Without Materialized Argument Yield Results That Further Joins NPE On

2020-02-10 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033985#comment-17033985
 ] 

John Roesler commented on KAFKA-9517:
-

Quick update, I've merged this fix to trunk, 2.5, and 2.4. I'll go ahead and 
mark this ticket resolved so that it doesn't prevent the creation of 2.4.1 or 
2.5.0 release candidates.

I'd still very much appreciate it if you can test it to make sure it resolves 
the issue for your use case. 

> KTable Joins Without Materialized Argument Yield Results That Further Joins 
> NPE On
> --
>
> Key: KAFKA-9517
> URL: https://issues.apache.org/jira/browse/KAFKA-9517
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Paul Snively
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
> Attachments: test.tar.xz
>
>
> The `KTable` API implemented [[here||#L842-L844]] 
> [https://github.com/apache/kafka/blob/2.4.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableImpl.java#L842-L844]
>  []|#L842-L844]] calls `doJoinOnForeignKey` with an argument of 
> `Materialized.with(null, null)`, as apparently do several other APIs. As the 
> comment spanning [these lines|#L1098-L1099]] makes clear, the result is a 
> `KTable` whose `valueSerde` (as a `KTableImpl`) is `null`. Therefore, 
> attempts to `join` etc. on the resulting `KTable` fail with a 
> `NullPointerException`.
> While there is an obvious workaround—explicitly construct the required 
> `Materialized` and use the APIs that take it as an argument—I have to admit I 
> find the existence of public APIs with this sort of bug, particularly when 
> the bug is literally documented as a comment in the source code, astonishing 
> to the point of incredulity. It calls the quality and trustworthiness of 
> Kafka Streams into serious question, and if a resolution is not forthcoming 
> within a week, we will be left with no other option but to consider technical 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9517) KTable Joins Without Materialized Argument Yield Results That Further Joins NPE On

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033962#comment-17033962
 ] 

ASF GitHub Bot commented on KAFKA-9517:
---

vvcephei commented on pull request #8061: KAFKA-9517: Fix default serdes with 
FK join
URL: https://github.com/apache/kafka/pull/8061
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KTable Joins Without Materialized Argument Yield Results That Further Joins 
> NPE On
> --
>
> Key: KAFKA-9517
> URL: https://issues.apache.org/jira/browse/KAFKA-9517
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Paul Snively
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
> Attachments: test.tar.xz
>
>
> The `KTable` API implemented [[here||#L842-L844]] 
> [https://github.com/apache/kafka/blob/2.4.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableImpl.java#L842-L844]
>  []|#L842-L844]] calls `doJoinOnForeignKey` with an argument of 
> `Materialized.with(null, null)`, as apparently do several other APIs. As the 
> comment spanning [these lines|#L1098-L1099]] makes clear, the result is a 
> `KTable` whose `valueSerde` (as a `KTableImpl`) is `null`. Therefore, 
> attempts to `join` etc. on the resulting `KTable` fail with a 
> `NullPointerException`.
> While there is an obvious workaround—explicitly construct the required 
> `Materialized` and use the APIs that take it as an argument—I have to admit I 
> find the existence of public APIs with this sort of bug, particularly when 
> the bug is literally documented as a comment in the source code, astonishing 
> to the point of incredulity. It calls the quality and trustworthiness of 
> Kafka Streams into serious question, and if a resolution is not forthcoming 
> within a week, we will be left with no other option but to consider technical 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9517) KTable Joins Without Materialized Argument Yield Results That Further Joins NPE On

2020-02-10 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033960#comment-17033960
 ] 

John Roesler commented on KAFKA-9517:
-

Hi Paul,

Ah, those failures shouldn't affect your testing. #8015 depends on a fix I 
added for the TopologyTestDriver (https://github.com/apache/kafka/pull/8065). 
If you want to clear the error, you can cherry-pick that one also, or you can 
just skip the tests and build the artifacts directly with `installAll`.

Sorry for neglecting to mention that initially; both changes were part of 8015 
to begin with, but the reviewers rightly suggested I should pull out the 
TopologyTestDriver fix into a separately verified PR.

Thanks,
-John

> KTable Joins Without Materialized Argument Yield Results That Further Joins 
> NPE On
> --
>
> Key: KAFKA-9517
> URL: https://issues.apache.org/jira/browse/KAFKA-9517
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Paul Snively
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
> Attachments: test.tar.xz
>
>
> The `KTable` API implemented [[here||#L842-L844]] 
> [https://github.com/apache/kafka/blob/2.4.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableImpl.java#L842-L844]
>  []|#L842-L844]] calls `doJoinOnForeignKey` with an argument of 
> `Materialized.with(null, null)`, as apparently do several other APIs. As the 
> comment spanning [these lines|#L1098-L1099]] makes clear, the result is a 
> `KTable` whose `valueSerde` (as a `KTableImpl`) is `null`. Therefore, 
> attempts to `join` etc. on the resulting `KTable` fail with a 
> `NullPointerException`.
> While there is an obvious workaround—explicitly construct the required 
> `Materialized` and use the APIs that take it as an argument—I have to admit I 
> find the existence of public APIs with this sort of bug, particularly when 
> the bug is literally documented as a comment in the source code, astonishing 
> to the point of incredulity. It calls the quality and trustworthiness of 
> Kafka Streams into serious question, and if a resolution is not forthcoming 
> within a week, we will be left with no other option but to consider technical 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9523) Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest

2020-02-10 Thread Guozhang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-9523.
--
Fix Version/s: 2.5.0
   Resolution: Fixed

> Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest
> --
>
> Key: KAFKA-9523
> URL: https://issues.apache.org/jira/browse/KAFKA-9523
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
> Fix For: 2.5.0
>
>
> KAFKA-9335 introduces an integration test to verify the topology builder 
> itself could survive from building a complex topology. This test gets flaky 
> some time for stream client to broker connection, so we should consider 
> making it less flaky by either converting to a unit test or just focus on 
> making the test logic more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9523) Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033947#comment-17033947
 ] 

ASF GitHub Bot commented on KAFKA-9523:
---

guozhangwang commented on pull request #8081: KAFKA-9523: Migrate 
BranchedMultiLevelRepartitionConnectedTopologyTest into a unit test
URL: https://github.com/apache/kafka/pull/8081
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest
> --
>
> Key: KAFKA-9523
> URL: https://issues.apache.org/jira/browse/KAFKA-9523
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> KAFKA-9335 introduces an integration test to verify the topology builder 
> itself could survive from building a complex topology. This test gets flaky 
> some time for stream client to broker connection, so we should consider 
> making it less flaky by either converting to a unit test or just focus on 
> making the test logic more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9480) Value for Task-level Metric process-rate is Constant Zero

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033946#comment-17033946
 ] 

ASF GitHub Bot commented on KAFKA-9480:
---

guozhangwang commented on pull request #8018: KAFKA-9480: Fix bug that 
prevented to measure task-level process-rate
URL: https://github.com/apache/kafka/pull/8018
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Value for Task-level Metric process-rate is Constant Zero 
> --
>
> Key: KAFKA-9480
> URL: https://issues.apache.org/jira/browse/KAFKA-9480
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> The value for task-level metric process-rate is constant zero. The value 
> should reflect the number of calls to {{process()}}  on source processors 
> which clearly cannot be constant zero. 
> This behavior applies to built-in metrics version {{latest}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9505) InternalTopicManager may falls into infinite loop with partially created topics

2020-02-10 Thread Guozhang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-9505.
--
Fix Version/s: 2.5.0
   Resolution: Fixed

> InternalTopicManager may falls into infinite loop with partially created 
> topics
> ---
>
> Key: KAFKA-9505
> URL: https://issues.apache.org/jira/browse/KAFKA-9505
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Guozhang Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> In {{InternalTopicManager#validateTopics(topicsNotReady, topics)}}, the 
> topics map (second) does not change while the first topicsNotReady may change 
> if some topics have been validated while others do not, however inside that 
> function we still loop of the second map which may never completes then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9505) InternalTopicManager may falls into infinite loop with partially created topics

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033942#comment-17033942
 ] 

ASF GitHub Bot commented on KAFKA-9505:
---

guozhangwang commented on pull request #8039: KAFKA-9505: Only loop over 
topics-to-validate in retries
URL: https://github.com/apache/kafka/pull/8039
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> InternalTopicManager may falls into infinite loop with partially created 
> topics
> ---
>
> Key: KAFKA-9505
> URL: https://issues.apache.org/jira/browse/KAFKA-9505
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Guozhang Wang
>Priority: Major
>
> In {{InternalTopicManager#validateTopics(topicsNotReady, topics)}}, the 
> topics map (second) does not change while the first topicsNotReady may change 
> if some topics have been validated while others do not, however inside that 
> function we still loop of the second map which may never completes then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9523) Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest

2020-02-10 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen reassigned KAFKA-9523:
--

Assignee: Boyang Chen

> Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest
> --
>
> Key: KAFKA-9523
> URL: https://issues.apache.org/jira/browse/KAFKA-9523
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> KAFKA-9335 introduces an integration test to verify the topology builder 
> itself could survive from building a complex topology. This test gets flaky 
> some time for stream client to broker connection, so we should consider 
> making it less flaky by either converting to a unit test or just focus on 
> making the test logic more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9534) Potential inconsistent result for listTopic talking to older brokers

2020-02-10 Thread Boyang Chen (Jira)
Boyang Chen created KAFKA-9534:
--

 Summary: Potential inconsistent result for listTopic talking to 
older brokers
 Key: KAFKA-9534
 URL: https://issues.apache.org/jira/browse/KAFKA-9534
 Project: Kafka
  Issue Type: Bug
  Components: admin
Affects Versions: 2.5.0
Reporter: Boyang Chen


Not sure this is indeed a bug, just making it under track.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9450) Decouple inner state flushing from committing

2020-02-10 Thread Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman updated KAFKA-9450:
---
Description: 
When EOS is turned on, the commit interval is set quite low (100ms) and all the 
store layers are flushed during a commit. This is necessary for forwarding 
records in the cache to the changelog, but unfortunately also forces rocksdb to 
flush the current memtable before it's full. The result is a large number of 
small writes to disk, losing the benefits of batching, and a large number of 
very small L0 files that are likely to slow compaction.

Since we have to delete the stores to recreate from scratch anyways during an 
unclean shutdown with EOS, we may as well skip flushing the innermost 
StateStore during a commit and only do so during a graceful shutdown, before a 
rebalance, etc. This is currently blocked on a refactoring of the state store 
layers to allow decoupling the flush of the caching layer from the actual state 
store.

Note that this is especially problematic with EOS due to the necessarily-low 
commit interval, but still hurts even with at-least-once and a much larger 
commit interval. 

  was:
When EOS is turned on, the commit interval is set quite low (100ms) and all the 
store layers are flushed during a commit. This is necessary for forwarding 
records in the cache to the changelog, but unfortunately also forces rocksdb to 
flush the current memtable before it's full. The result is a large number of 
small writes to disk, losing the benefits of batching, and a large number of 
very small L0 files that are likely to slow compaction.

Since we have to delete the stores to recreate from scratch anyways during an 
unclean shutdown with EOS, we may as well skip flushing the innermost 
StateStore during a commit and only do so during a graceful shutdown, before a 
rebalance, etc. This is currently blocked on a refactoring of the state store 
layers to allow decoupling the flush of the caching layer from the actual state 
store.


> Decouple inner state flushing from committing
> -
>
> Key: KAFKA-9450
> URL: https://issues.apache.org/jira/browse/KAFKA-9450
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> When EOS is turned on, the commit interval is set quite low (100ms) and all 
> the store layers are flushed during a commit. This is necessary for 
> forwarding records in the cache to the changelog, but unfortunately also 
> forces rocksdb to flush the current memtable before it's full. The result is 
> a large number of small writes to disk, losing the benefits of batching, and 
> a large number of very small L0 files that are likely to slow compaction.
> Since we have to delete the stores to recreate from scratch anyways during an 
> unclean shutdown with EOS, we may as well skip flushing the innermost 
> StateStore during a commit and only do so during a graceful shutdown, before 
> a rebalance, etc. This is currently blocked on a refactoring of the state 
> store layers to allow decoupling the flush of the caching layer from the 
> actual state store.
> Note that this is especially problematic with EOS due to the necessarily-low 
> commit interval, but still hurts even with at-least-once and a much larger 
> commit interval. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9450) Decouple inner state flushing from committing

2020-02-10 Thread Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman updated KAFKA-9450:
---
Summary: Decouple inner state flushing from committing  (was: Decouple 
inner state flushing from committing with EOS)

> Decouple inner state flushing from committing
> -
>
> Key: KAFKA-9450
> URL: https://issues.apache.org/jira/browse/KAFKA-9450
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> When EOS is turned on, the commit interval is set quite low (100ms) and all 
> the store layers are flushed during a commit. This is necessary for 
> forwarding records in the cache to the changelog, but unfortunately also 
> forces rocksdb to flush the current memtable before it's full. The result is 
> a large number of small writes to disk, losing the benefits of batching, and 
> a large number of very small L0 files that are likely to slow compaction.
> Since we have to delete the stores to recreate from scratch anyways during an 
> unclean shutdown with EOS, we may as well skip flushing the innermost 
> StateStore during a commit and only do so during a graceful shutdown, before 
> a rebalance, etc. This is currently blocked on a refactoring of the state 
> store layers to allow decoupling the flush of the caching layer from the 
> actual state store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9450) Decouple inner state flushing from committing with EOS

2020-02-10 Thread Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033915#comment-17033915
 ] 

Sophie Blee-Goldman commented on KAFKA-9450:


[~NaviBrar] I assume that was only merged into the most recent RocksdB version? 
We aren't able to bump the rocksdb dependency further until the next major 
version bump due to some breaking changes in the options-related API. Not sure 
if the rocks folks might be willing to cherry-pick this back to a 5.x version 
and release that, if not maybe you should make a ticket to track this and mark 
as blocked by KAFKA-8897 so it doesn't get forgotten

> Decouple inner state flushing from committing with EOS
> --
>
> Key: KAFKA-9450
> URL: https://issues.apache.org/jira/browse/KAFKA-9450
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> When EOS is turned on, the commit interval is set quite low (100ms) and all 
> the store layers are flushed during a commit. This is necessary for 
> forwarding records in the cache to the changelog, but unfortunately also 
> forces rocksdb to flush the current memtable before it's full. The result is 
> a large number of small writes to disk, losing the benefits of batching, and 
> a large number of very small L0 files that are likely to slow compaction.
> Since we have to delete the stores to recreate from scratch anyways during an 
> unclean shutdown with EOS, we may as well skip flushing the innermost 
> StateStore during a commit and only do so during a graceful shutdown, before 
> a rebalance, etc. This is currently blocked on a refactoring of the state 
> store layers to allow decoupling the flush of the caching layer from the 
> actual state store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-8897) Increase Version of RocksDB

2020-02-10 Thread Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman updated KAFKA-8897:
---
Fix Version/s: 3.0.0

> Increase Version of RocksDB
> ---
>
> Key: KAFKA-8897
> URL: https://issues.apache.org/jira/browse/KAFKA-8897
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 3.0.0
>
>
> A higher version (6+) of RocksDB is needed for some metrics specified in 
> KIP-471. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-8897) Increase Version of RocksDB

2020-02-10 Thread Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman updated KAFKA-8897:
---
Priority: Blocker  (was: Major)

> Increase Version of RocksDB
> ---
>
> Key: KAFKA-8897
> URL: https://issues.apache.org/jira/browse/KAFKA-8897
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Blocker
> Fix For: 3.0.0
>
>
> A higher version (6+) of RocksDB is needed for some metrics specified in 
> KIP-471. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9523) Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033913#comment-17033913
 ] 

ASF GitHub Bot commented on KAFKA-9523:
---

abbccdda commented on pull request #8081: KAFKA-9523: Migrate 
BranchedMultiLevelRepartitionConnectedTopologyTest into a unit test
URL: https://github.com/apache/kafka/pull/8081
 
 
   Relying on integration test to catch an algorithm bug introduces more 
flakiness, reduce the test into a unit test to reduce the flakiness until we 
upgrade Java/Scala libs.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Reduce flakiness of BranchedMultiLevelRepartitionConnectedTopologyTest
> --
>
> Key: KAFKA-9523
> URL: https://issues.apache.org/jira/browse/KAFKA-9523
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Boyang Chen
>Priority: Major
>
> KAFKA-9335 introduces an integration test to verify the topology builder 
> itself could survive from building a complex topology. This test gets flaky 
> some time for stream client to broker connection, so we should consider 
> making it less flaky by either converting to a unit test or just focus on 
> making the test logic more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9423) Refine layout of configuration options on website and make individual settings directly linkable

2020-02-10 Thread Mickael Maison (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mickael Maison resolved KAFKA-9423.
---
Fix Version/s: 2.6.0
   Resolution: Fixed

> Refine layout of configuration options on website and make individual 
> settings directly linkable
> 
>
> Key: KAFKA-9423
> URL: https://issues.apache.org/jira/browse/KAFKA-9423
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sönke Liebau
>Assignee: Sönke Liebau
>Priority: Trivial
> Fix For: 2.6.0
>
> Attachments: option1.png, option2.png, option3.png, option4.png
>
>
> KAFKA-8474 changed the layout of configuration options on the website from a 
> table which over time ran out of horizontal space to a list.
> This vastly improved readability but is not yet ideal. Further discussion was 
> had in the [PR|https://github.com/apache/kafka/pull/6870] for KAFKA-8474.
> This ticket is to move that discussion to a separate thread and make it more 
> visible to other people and to give subsequent PRs a home.
> Currently proposed options are attached to this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-6266) Kafka 1.0.0 : Repeated occurrence of WARN Resetting first dirty offset of __consumer_offsets-xx to log start offset 203569 since the checkpointed offset 120955 is invali

2020-02-10 Thread David Mao (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mao reassigned KAFKA-6266:


Assignee: David Mao  (was: Anna Povzner)

> Kafka 1.0.0 : Repeated occurrence of WARN Resetting first dirty offset of 
> __consumer_offsets-xx to log start offset 203569 since the checkpointed 
> offset 120955 is invalid. (kafka.log.LogCleanerManager$)
> --
>
> Key: KAFKA-6266
> URL: https://issues.apache.org/jira/browse/KAFKA-6266
> Project: Kafka
>  Issue Type: Bug
>  Components: offset manager
>Affects Versions: 1.0.0, 1.0.1
> Environment: CentOS 7, Apache kafka_2.12-1.0.0
>Reporter: VinayKumar
>Assignee: David Mao
>Priority: Major
> Fix For: 2.5.0, 2.4.1
>
>
> I upgraded Kafka from 0.10.2.1 to 1.0.0 version. From then, I see the below 
> warnings in the log.
>  I'm seeing these continuously in the log, and want these to be fixed- so 
> that they wont repeat. Can someone please help me in fixing the below 
> warnings.
> {code}
> WARN Resetting first dirty offset of __consumer_offsets-17 to log start 
> offset 3346 since the checkpointed offset 3332 is invalid. 
> (kafka.log.LogCleanerManager$)
>  WARN Resetting first dirty offset of __consumer_offsets-23 to log start 
> offset 4 since the checkpointed offset 1 is invalid. 
> (kafka.log.LogCleanerManager$)
>  WARN Resetting first dirty offset of __consumer_offsets-19 to log start 
> offset 203569 since the checkpointed offset 120955 is invalid. 
> (kafka.log.LogCleanerManager$)
>  WARN Resetting first dirty offset of __consumer_offsets-35 to log start 
> offset 16957 since the checkpointed offset 7 is invalid. 
> (kafka.log.LogCleanerManager$)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9533) ValueTransform forwards `null` values

2020-02-10 Thread Michael Viamari (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Viamari updated KAFKA-9533:
---
Description: 
According to the documentation for `KStream#transformValues`, nulls returned 
from `ValueTransformer#transform` are not forwarded. (see 
[KStream#transformValues|https://kafka.apache.org/24/javadoc/org/apache/kafka/streams/kstream/KStream.html#transformValues-org.apache.kafka.streams.kstream.ValueTransformerSupplier-java.lang.String...-]

However, this does not appear to be the case. In 
`KStreamTransformValuesProcessor#process` the result of the transform is 
forwarded directly.
{code:java}
 @Override
 public void process(final K key, final V value) {
 context.forward(key, valueTransformer.transform(key, value));
 }
{code}

  was:
According to the documentation for `KStream#transformValues`, nulls returned 
from `ValueTransformer#transform` are not forwarded. (see 
[KStream#transformValues|https://kafka.apache.org/24/javadoc/org/apache/kafka/streams/kstream/KStream.html#transformValues-org.apache.kafka.streams.kstream.ValueTransformerSupplier-java.lang.String...-]

However, this does not appear to be the case. In 
`KStreamTransformValuesProcessor#transform` the result of the transform is 
forwarded directly.
{code:java}
 @Override
 public void process(final K key, final V value) {
 context.forward(key, valueTransformer.transform(key, value));
 }
{code}


> ValueTransform forwards `null` values
> -
>
> Key: KAFKA-9533
> URL: https://issues.apache.org/jira/browse/KAFKA-9533
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Michael Viamari
>Priority: Minor
>
> According to the documentation for `KStream#transformValues`, nulls returned 
> from `ValueTransformer#transform` are not forwarded. (see 
> [KStream#transformValues|https://kafka.apache.org/24/javadoc/org/apache/kafka/streams/kstream/KStream.html#transformValues-org.apache.kafka.streams.kstream.ValueTransformerSupplier-java.lang.String...-]
> However, this does not appear to be the case. In 
> `KStreamTransformValuesProcessor#process` the result of the transform is 
> forwarded directly.
> {code:java}
>  @Override
>  public void process(final K key, final V value) {
>  context.forward(key, valueTransformer.transform(key, value));
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9533) ValueTransform forwards `null` values

2020-02-10 Thread Michael Viamari (Jira)
Michael Viamari created KAFKA-9533:
--

 Summary: ValueTransform forwards `null` values
 Key: KAFKA-9533
 URL: https://issues.apache.org/jira/browse/KAFKA-9533
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 2.4.0
Reporter: Michael Viamari


According to the documentation for `KStream#transformValues`, nulls returned 
from `ValueTransformer#transform` are not forwarded. (see 
[KStream#transformValues|https://kafka.apache.org/24/javadoc/org/apache/kafka/streams/kstream/KStream.html#transformValues-org.apache.kafka.streams.kstream.ValueTransformerSupplier-java.lang.String...-]

However, this does not appear to be the case. In 
`KStreamTransformValuesProcessor#transform` the result of the transform is 
forwarded directly.
{code:java}
 @Override
 public void process(final K key, final V value) {
 context.forward(key, valueTransformer.transform(key, value));
 }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9487) Followup : KAFKA-9445(Allow fetching a key from a single partition); addressing code review comments

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033823#comment-17033823
 ] 

ASF GitHub Bot commented on KAFKA-9487:
---

vvcephei commented on pull request #8033: KAFKA-9487: Follow-up PR of Kafka-9445
URL: https://github.com/apache/kafka/pull/8033
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Followup : KAFKA-9445(Allow fetching a key from a single partition); 
> addressing code review comments
> 
>
> Key: KAFKA-9487
> URL: https://issues.apache.org/jira/browse/KAFKA-9487
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Navinder Brar
>Assignee: Navinder Brar
>Priority: Blocker
> Fix For: 2.5.0
>
>
> A few code review comments are left to be addressed from Kafka 9445, which I 
> will be addressing in this PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9423) Refine layout of configuration options on website and make individual settings directly linkable

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033816#comment-17033816
 ] 

ASF GitHub Bot commented on KAFKA-9423:
---

mimaison commented on pull request #7955: KAFKA-9423: Refine layout of 
configuration options on website and make individual settings directly linkable
URL: https://github.com/apache/kafka/pull/7955
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refine layout of configuration options on website and make individual 
> settings directly linkable
> 
>
> Key: KAFKA-9423
> URL: https://issues.apache.org/jira/browse/KAFKA-9423
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sönke Liebau
>Assignee: Sönke Liebau
>Priority: Trivial
> Attachments: option1.png, option2.png, option3.png, option4.png
>
>
> KAFKA-8474 changed the layout of configuration options on the website from a 
> table which over time ran out of horizontal space to a list.
> This vastly improved readability but is not yet ideal. Further discussion was 
> had in the [PR|https://github.com/apache/kafka/pull/6870] for KAFKA-8474.
> This ticket is to move that discussion to a separate thread and make it more 
> visible to other people and to give subsequent PRs a home.
> Currently proposed options are attached to this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4090) JVM runs into OOM if (Java) client uses a SSL port without setting the security protocol

2020-02-10 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033810#comment-17033810
 ] 

David Mollitor commented on KAFKA-4090:
---

https://github.com/apache/kafka/pull/8066

> JVM runs into OOM if (Java) client uses a SSL port without setting the 
> security protocol
> 
>
> Key: KAFKA-4090
> URL: https://issues.apache.org/jira/browse/KAFKA-4090
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.9.0.1, 0.10.0.1, 2.1.0
>Reporter: Jaikiran Pai
>Assignee: Alexandre Dupriez
>Priority: Major
>
> Quoting from the mail thread that was sent to Kafka mailing list:
> {quote}
> We have been using Kafka 0.9.0.1 (server and Java client libraries). So far 
> we had been using it with plaintext transport but recently have been 
> considering upgrading to using SSL. It mostly works except that a 
> mis-configured producer (and even consumer) causes a hard to relate 
> OutOfMemory exception and thus causing the JVM in which the client is 
> running, to go into a bad state. We can consistently reproduce that OOM very 
> easily. We decided to check if this is something that is fixed in 0.10.0.1 so 
> upgraded one of our test systems to that version (both server and client 
> libraries) but still see the same issue. Here's how it can be easily 
> reproduced
> 1. Enable SSL listener on the broker via server.properties, as per the Kafka 
> documentation
> {code}
> listeners=PLAINTEXT://:9092,SSL://:9093
> ssl.keystore.location=
> ssl.keystore.password=pass
> ssl.key.password=pass
> ssl.truststore.location=
> ssl.truststore.password=pass
> {code}
> 2. Start zookeeper and kafka server
> 3. Create a "oom-test" topic (which will be used for these tests):
> {code}
> kafka-topics.sh --zookeeper localhost:2181 --create --topic oom-test  
> --partitions 1 --replication-factor 1
> {code}
> 4. Create a simple producer which sends a single message to the topic via 
> Java (new producer) APIs:
> {code}
> public class OOMTest {
> public static void main(final String[] args) throws Exception {
> final Properties kafkaProducerConfigs = new Properties();
> // NOTE: Intentionally use a SSL port without specifying 
> security.protocol as SSL
> 
> kafkaProducerConfigs.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, 
> "localhost:9093");
> 
> kafkaProducerConfigs.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, 
> StringSerializer.class.getName());
> 
> kafkaProducerConfigs.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
>  StringSerializer.class.getName());
> try (KafkaProducer producer = new 
> KafkaProducer<>(kafkaProducerConfigs)) {
> System.out.println("Created Kafka producer");
> final String topicName = "oom-test";
> final String message = "Hello OOM!";
> // send a message to the topic
> final Future recordMetadataFuture = 
> producer.send(new ProducerRecord<>(topicName, message));
> final RecordMetadata sentRecordMetadata = 
> recordMetadataFuture.get();
> System.out.println("Sent message '" + message + "' to topic '" + 
> topicName + "'");
> }
> System.out.println("Tests complete");
> }
> }
> {code}
> Notice that the server URL is using a SSL endpoint localhost:9093 but isn't 
> specifying any of the other necessary SSL configs like security.protocol.
> 5. For the sake of easily reproducing this issue run this class with a max 
> heap size of 256MB (-Xmx256M). Running this code throws up the following 
> OutOfMemoryError in one of the Sender threads:
> {code}
> 18:33:25,770 ERROR [KafkaThread] - Uncaught exception in 
> kafka-producer-network-thread | producer-1:
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at 
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:93)
> at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
> at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
> at 
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:256)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:216)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:128)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Note that I set it to 256MB as heap size to easily reproduce it but this 
> isn't specific to 

[jira] [Commented] (KAFKA-9517) KTable Joins Without Materialized Argument Yield Results That Further Joins NPE On

2020-02-10 Thread Paul Snively (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033785#comment-17033785
 ] 

Paul Snively commented on KAFKA-9517:
-

Thanks for the suggestion to squash and cherry-pick #8015 and #8061. I've done 
that, and `.gradlew test` is giving me four errors that seem related to the 
cherry-picked PRs. I'm attaching the test report for others to perhaps analyze. 
My colleague and I will also attempt to reproduce the issues we specifically 
encountered in using 2.4.0. I'm reasonably confident we can also take some time 
to review these two PRs, but that seems somewhat unlikely to happen today.

[^test.tar.xz]

> KTable Joins Without Materialized Argument Yield Results That Further Joins 
> NPE On
> --
>
> Key: KAFKA-9517
> URL: https://issues.apache.org/jira/browse/KAFKA-9517
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Paul Snively
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
> Attachments: test.tar.xz
>
>
> The `KTable` API implemented [[here||#L842-L844]] 
> [https://github.com/apache/kafka/blob/2.4.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableImpl.java#L842-L844]
>  []|#L842-L844]] calls `doJoinOnForeignKey` with an argument of 
> `Materialized.with(null, null)`, as apparently do several other APIs. As the 
> comment spanning [these lines|#L1098-L1099]] makes clear, the result is a 
> `KTable` whose `valueSerde` (as a `KTableImpl`) is `null`. Therefore, 
> attempts to `join` etc. on the resulting `KTable` fail with a 
> `NullPointerException`.
> While there is an obvious workaround—explicitly construct the required 
> `Materialized` and use the APIs that take it as an argument—I have to admit I 
> find the existence of public APIs with this sort of bug, particularly when 
> the bug is literally documented as a comment in the source code, astonishing 
> to the point of incredulity. It calls the quality and trustworthiness of 
> Kafka Streams into serious question, and if a resolution is not forthcoming 
> within a week, we will be left with no other option but to consider technical 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9517) KTable Joins Without Materialized Argument Yield Results That Further Joins NPE On

2020-02-10 Thread Paul Snively (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Snively updated KAFKA-9517:

Attachment: test.tar.xz

> KTable Joins Without Materialized Argument Yield Results That Further Joins 
> NPE On
> --
>
> Key: KAFKA-9517
> URL: https://issues.apache.org/jira/browse/KAFKA-9517
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.4.0
>Reporter: Paul Snively
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
> Attachments: test.tar.xz
>
>
> The `KTable` API implemented [[here||#L842-L844]] 
> [https://github.com/apache/kafka/blob/2.4.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableImpl.java#L842-L844]
>  []|#L842-L844]] calls `doJoinOnForeignKey` with an argument of 
> `Materialized.with(null, null)`, as apparently do several other APIs. As the 
> comment spanning [these lines|#L1098-L1099]] makes clear, the result is a 
> `KTable` whose `valueSerde` (as a `KTableImpl`) is `null`. Therefore, 
> attempts to `join` etc. on the resulting `KTable` fail with a 
> `NullPointerException`.
> While there is an obvious workaround—explicitly construct the required 
> `Materialized` and use the APIs that take it as an argument—I have to admit I 
> find the existence of public APIs with this sort of bug, particularly when 
> the bug is literally documented as a comment in the source code, astonishing 
> to the point of incredulity. It calls the quality and trustworthiness of 
> Kafka Streams into serious question, and if a resolution is not forthcoming 
> within a week, we will be left with no other option but to consider technical 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-8623) KafkaProducer possible deadlock when sending to different topics

2020-02-10 Thread Brian Byrne (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Byrne resolved KAFKA-8623.

Fix Version/s: 2.3.0
   Resolution: Fixed

This appears to be due to an issue concerning the handling of consecutive 
metadata updates in clients, where the first update could effectively clear the 
request for the second because no version/instance which request was 
outstanding was maintained. This was fixed in PR 
[6621|https://github.com/apache/kafka/pull/6221] (see item 3), which is 
available in the 2.3.0 release.

> KafkaProducer possible deadlock when sending to different topics
> 
>
> Key: KAFKA-8623
> URL: https://issues.apache.org/jira/browse/KAFKA-8623
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.2.1
>Reporter: Alexander Bagiev
>Assignee: Kun Song
>Priority: Critical
> Fix For: 2.3.0
>
>
> Project with bug reproduction: [https://github.com/abagiev/kafka-producer-bug]
> It was found that sending two messages in two different topics in a row 
> results in hanging of KafkaProducer for 60s and the following exception:
> {noformat}
> org.springframework.kafka.core.KafkaProducerException: Failed to send; nested 
> exception is org.apache.kafka.common.errors.TimeoutException: Failed to 
> update metadata after 6 ms.
>   at 
> org.springframework.kafka.core.KafkaTemplate.lambda$buildCallback$0(KafkaTemplate.java:405)
>  ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:877)
>  ~[kafka-clients-2.0.1.jar:na]
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:803) 
> ~[kafka-clients-2.0.1.jar:na]
>   at 
> org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer.send(DefaultKafkaProducerFactory.java:444)
>  ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
>   at 
> org.springframework.kafka.core.KafkaTemplate.doSend(KafkaTemplate.java:381) 
> ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
>   at 
> org.springframework.kafka.core.KafkaTemplate.send(KafkaTemplate.java:193) 
> ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
> ...
> {noformat}
> It looks like KafkaProducer requests two times for meta information for each 
> topic and hangs just before second request due to some deadlock. When 60s 
> pass TimeoutException is thrown and meta information is requested/received 
> immediately (but after exception has been already thrown).
> The issue in the example project is reproduced every time; and the use case 
> is trivial.
>  This is a critical bug for us.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9515) Upgrade ZooKeeper to 3.5.7

2020-02-10 Thread Ron Dagostino (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033759#comment-17033759
 ] 

Ron Dagostino commented on KAFKA-9515:
--

We should probably also expand the system tests to include the case where ZK 
TLS is enabled but clientAuth=none.

> Upgrade ZooKeeper to 3.5.7
> --
>
> Key: KAFKA-9515
> URL: https://issues.apache.org/jira/browse/KAFKA-9515
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
>
> There are some critical fixes in ZK 3.5.7 and the first RC has been posted:
> [https://mail-archives.apache.org/mod_mbox/zookeeper-dev/202002.mbox/%3cCAGH6_KiULzemT-V4x_2ybWeKLMvQ+eh=q-dzsiz8a-ypp5t...@mail.gmail.com%3e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9515) Upgrade ZooKeeper to 3.5.7

2020-02-10 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033753#comment-17033753
 ] 

Ismael Juma commented on KAFKA-9515:


[~rndgstn] I actually think 2.5 should ship with ZK 3.5.7. Does that mean we 
have to do extra work?

> Upgrade ZooKeeper to 3.5.7
> --
>
> Key: KAFKA-9515
> URL: https://issues.apache.org/jira/browse/KAFKA-9515
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
>
> There are some critical fixes in ZK 3.5.7 and the first RC has been posted:
> [https://mail-archives.apache.org/mod_mbox/zookeeper-dev/202002.mbox/%3cCAGH6_KiULzemT-V4x_2ybWeKLMvQ+eh=q-dzsiz8a-ypp5t...@mail.gmail.com%3e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-9515) Upgrade ZooKeeper to 3.5.7

2020-02-10 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033753#comment-17033753
 ] 

Ismael Juma edited comment on KAFKA-9515 at 2/10/20 4:58 PM:
-

[~rndgstn] I actually think AK 2.5 should ship with ZK 3.5.7. Does that mean we 
have to do extra work?


was (Author: ijuma):
[~rndgstn] I actually think 2.5 should ship with ZK 3.5.7. Does that mean we 
have to do extra work?

> Upgrade ZooKeeper to 3.5.7
> --
>
> Key: KAFKA-9515
> URL: https://issues.apache.org/jira/browse/KAFKA-9515
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
>
> There are some critical fixes in ZK 3.5.7 and the first RC has been posted:
> [https://mail-archives.apache.org/mod_mbox/zookeeper-dev/202002.mbox/%3cCAGH6_KiULzemT-V4x_2ybWeKLMvQ+eh=q-dzsiz8a-ypp5t...@mail.gmail.com%3e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8374) KafkaApis.handleLeaderAndIsrRequest not robust to ZooKeeper exceptions

2020-02-10 Thread Jun Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033734#comment-17033734
 ] 

Jun Rao commented on KAFKA-8374:


This could be related to https://issues.apache.org/jira/browse/KAFKA-9307.

> KafkaApis.handleLeaderAndIsrRequest not robust to ZooKeeper exceptions
> --
>
> Key: KAFKA-8374
> URL: https://issues.apache.org/jira/browse/KAFKA-8374
> Project: Kafka
>  Issue Type: Bug
>  Components: core, offset manager
>Affects Versions: 2.0.1
> Environment: Linux x86_64 (Ubuntu Xenial) running on AWS EC2
>Reporter: Mike Mintz
>Assignee: Bob Barrett
>Priority: Major
>
> h2. Summary of bug (theory)
> During a leader election, when a broker is transitioning from leader to 
> follower on some __consumer_offset partitions and some __transaction_state 
> partitions, it’s possible for a ZooKeeper exception to be thrown, leaving the 
> GroupMetadataManager in an inconsistent state.
>  
> In particular, in KafkaApis.handleLeaderAndIsrRequest in the 
> onLeadershipChange callback, it’s possible for 
> TransactionCoordinator.handleTxnEmigration to throw 
> ZooKeeperClientExpiredException, ending the updatedFollowers.foreach loop 
> early. If there were any __consumer_offset partitions to be handled later in 
> the loop, GroupMetadataManager will be left with stale data in its 
> groupMetadataCache. Later, when this broker resumes leadership for the 
> affected __consumer_offset partitions, it will fail to load the updated 
> groups into the cache since it uses putIfNotExists, and it will serve stale 
> offsets to consumers.
>  
> h2. Details of what we experienced
> We ran into this issue running Kafka 2.0.1 in production. Several Kafka 
> consumers received stale offsets when reconnecting to their group coordinator 
> after a leadership election on their __consumer_offsets partition. This 
> caused them to reprocess many duplicate messages.
>  
> We believe we’ve tracked down the root cause: * On 2019-04-01, we were having 
> memory pressure in ZooKeeper, and we were getting several 
> ZooKeeperClientExpiredException errors in the logs.
>  * The impacted consumers were all in __consumer_offsets-15. There was a 
> leader election on this partition, and leadership transitioned from broker 
> 1088 to broker 1069. During this leadership election, the former leader 
> (1088) saw a ZooKeeperClientExpiredException  (stack trace below). This 
> happened inside KafkaApis.handleLeaderAndIsrRequest, specifically in 
> onLeadershipChange while it was updating a __transaction_state partition. 
> Since there are no “Scheduling unloading” or “Finished unloading” log 
> messages in this period, we believe it threw this exception before getting to 
> __consumer_offsets-15, so it never got a chance to call 
> GroupCoordinator.handleGroupEmigration, which means this broker didn’t unload 
> offsets from its GroupMetadataManager.
>  * Four days later, on 2019-04-05, we manually restarted broker 1069, so 
> broker 1088 became the leader of __consumer_offsets-15 again. When it ran 
> GroupMetadataManager.loadGroup, it presumably failed to update 
> groupMetadataCache since it uses putIfNotExists, and it would have found the 
> group id already in the cache. Unfortunately we did not have debug logging 
> enabled, but I would expect to have seen a log message like "Attempt to load 
> group ${group.groupId} from log with generation ${group.generationId} failed 
> because there is already a cached group with generation 
> ${currentGroup.generationId}".
>  * After the leadership election, the impacted consumers reconnected to 
> broker 1088 and received stale offsets that correspond to the last committed 
> offsets around 2019-04-01.
>  
> h2. Relevant log/stacktrace
> {code:java}
> [2019-04-01 22:44:18.968617] [2019-04-01 22:44:18,963] ERROR [KafkaApi-1088] 
> Error when handling request 
> {controller_id=1096,controller_epoch=122,partition_states=[...,{topic=__consumer_offsets,partition=15,controller_epoch=122,leader=1069,leader_epoch=440,isr=[1092,1088,1069],zk_version=807,replicas=[1069,1088,1092],is_new=false},...],live_leaders=[{id=1069,host=10.68.42.121,port=9094}]}
>  (kafka.server.KafkaApis)
> [2019-04-01 22:44:18.968689] kafka.zookeeper.ZooKeeperClientExpiredException: 
> Session expired either before or while waiting for connection
> [2019-04-01 22:44:18.968712]         at 
> kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:238)
> [2019-04-01 22:44:18.968736]         at 
> kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:226)
> [2019-04-01 22:44:18.968759]         at 
> 

[jira] [Commented] (KAFKA-9515) Upgrade ZooKeeper to 3.5.7

2020-02-10 Thread Ron Dagostino (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033729#comment-17033729
 ] 

Ron Dagostino commented on KAFKA-9515:
--

ZooKeeper 3.5.7 also adds support for the "ssl.clientAuth=[want|need|none]" 
configuration on the ZooKeeper server side.  This means with v3.5.7 client 
certificates become optional (they are required in 3.5.6, which is what shipped 
with AK 2.4 and what will ship with AK 2.5).  As per [this GitHub PR 
conversation for KIP 
515|https://github.com/apache/kafka/pull/8003#discussion_r376476887] (text 
adjusted abit now that we have more info):

"We need to decide in 3 places (KafkaServer, ConfigCommand, and 
ZkSecurityMigrator) whether or not the ZooKeeper client should generate ACls in 
ZooKeeper when creating znodes. Prior to the possibility of x509 authentication 
it was easy to decide: was SASL enabled to ZooKeeper or not. Now it is 
supported for SASL to not be enabled but x509 auth to be enabled -- and in that 
case we want to generate ACLs. So in the 3 cases we have to look for this 
possibility. I agree it is entirely possible that ZooKeeper might not 
authenticate the client -- technically in ZK 3.5.6 it is not possible to turn 
that off, but it will be possible in ZK 3.5.7 and beyond. So while with 
ZooKeeper 3.5.6 it isn't an issue, at some point in the future it will be. It 
is possible that ZK might ignore the client certificate, we might generate 
ACLs, and those ACLs might grant access to World. One idea to avoid this is to 
make the connection with ACls enabled, create a random temporary znode, read 
the ACls, and check if it is world-enabled; then abort at that point if it is. 
It would probably be a good idea to add this when we upgrade to ZooKeeper 
3.5.7."


> Upgrade ZooKeeper to 3.5.7
> --
>
> Key: KAFKA-9515
> URL: https://issues.apache.org/jira/browse/KAFKA-9515
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Blocker
> Fix For: 2.5.0, 2.4.1
>
>
> There are some critical fixes in ZK 3.5.7 and the first RC has been posted:
> [https://mail-archives.apache.org/mod_mbox/zookeeper-dev/202002.mbox/%3cCAGH6_KiULzemT-V4x_2ybWeKLMvQ+eh=q-dzsiz8a-ypp5t...@mail.gmail.com%3e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9469) Add zookeeper.ssl.context.supplier.class config if/when adopting ZooKeeper 3.6

2020-02-10 Thread Ron Dagostino (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033727#comment-17033727
 ] 

Ron Dagostino commented on KAFKA-9469:
--

Just checked and "zookeeper.ssl.context.supplier.class" did not make it into 
ZooKeeper 3.5.7.  So we must still wait for v3.6.

> Add zookeeper.ssl.context.supplier.class config if/when adopting ZooKeeper 3.6
> --
>
> Key: KAFKA-9469
> URL: https://issues.apache.org/jira/browse/KAFKA-9469
> Project: Kafka
>  Issue Type: New Feature
>  Components: config
>Reporter: Ron Dagostino
>Assignee: Ron Dagostino
>Priority: Minor
>
> The "zookeeper.ssl.context.supplier.class" configuration doesn't actually 
> exist in ZooKeeper 3.5.6.  The ZooKeeper admin guide documents it as being 
> there, but it doesn't appear in the code.  This means we can't support it in 
> KIP-515, and it has been removed from that KIP.
> I checked the latest ZooKeeper 3.6 SNAPSHOT, and it has been added.  So this 
> config could probably be added to Kafka via a new, small KIP if/when we 
> upgrade to ZooKeeper 3.6 (which looks to be in release-candidate stage at the 
> moment).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-9531) java.net.UnknownHostException loop on VM rolling update using CNAME

2020-02-10 Thread Rui Abreu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033650#comment-17033650
 ] 

Rui Abreu edited comment on KAFKA-9531 at 2/10/20 3:13 PM:
---

Seems related to the following issues:

 

https://issues.apache.org/jira/browse/KAFKA-7890

https://issues.apache.org/jira/browse/KAFKA-8057

https://issues.apache.org/jira/browse/KAFKA-7755


was (Author: rabreu):
Seems related to the following issues:

 

https://issues.apache.org/jira/browse/KAFKA-7890

https://issues.apache.org/jira/browse/KAFKA-8057

> java.net.UnknownHostException loop on VM rolling update using CNAME
> ---
>
> Key: KAFKA-9531
> URL: https://issues.apache.org/jira/browse/KAFKA-9531
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, controller, network, producer 
>Affects Versions: 2.4.0
>Reporter: Rui Abreu
>Priority: Major
>
> Hello,
>  
> My cluster setup in based on VMs behind DNS CNAME .
> Example:  node.internal is a CNAME to either nodeA.internal or nodeB.internal
> Since kafka-client 1.2.1,  it has been observed that sometimes Kafka clients 
> get stuck on a loop with the exception:
> Example after nodeB.internal is replaced with nodeA.internal 
>  
> {code:java}
> 2020-02-10T12:11:28.181Z o.a.k.c.NetworkClient [WARN]- [Consumer 
> clientId=consumer-6, groupId=consumer.group] Error connecting to node 
> nodeB.internal:9092 (id: 2 rack: null)
> java.net.UnknownHostException: nodeB.internal:9092
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1281) 
> ~[?:1.8.0_222]
>   at java.net.InetAddress.getAllByName(InetAddress.java:1193) 
> ~[?:1.8.0_222]
>   at java.net.InetAddress.getAllByName(InetAddress.java:1127) 
> ~[?:1.8.0_222]
>   at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:403)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:943)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:68) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1114)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1005)
>  ~[stormjar.jar:?]
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:537) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:161)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:366)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1251)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1220) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.kafka.spout.KafkaSpout.pollKafkaBroker(KafkaSpout.java:365) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:294) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.daemon.executor$fn__10715$fn__10730$fn__10761.invoke(executor.clj:649)
>  ~[storm-core-1.1.3.jar:1.1.3]
>   at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) 
> ~[storm-core-1.1.3.jar:1.1.3]
>   at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.7.0.jar:?]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> {code}
>  
> The time it spends in the loop is arbitrary, but it seems the client 
> effectively stops while this is happening.
> This error contrasts with instances where the client is able to recover on 
> its own after a few seconds:
> {code:java}
> 

[jira] [Assigned] (KAFKA-8623) KafkaProducer possible deadlock when sending to different topics

2020-02-10 Thread Kun Song (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kun Song reassigned KAFKA-8623:
---

Assignee: Kun Song

> KafkaProducer possible deadlock when sending to different topics
> 
>
> Key: KAFKA-8623
> URL: https://issues.apache.org/jira/browse/KAFKA-8623
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.2.1
>Reporter: Alexander Bagiev
>Assignee: Kun Song
>Priority: Critical
>
> Project with bug reproduction: [https://github.com/abagiev/kafka-producer-bug]
> It was found that sending two messages in two different topics in a row 
> results in hanging of KafkaProducer for 60s and the following exception:
> {noformat}
> org.springframework.kafka.core.KafkaProducerException: Failed to send; nested 
> exception is org.apache.kafka.common.errors.TimeoutException: Failed to 
> update metadata after 6 ms.
>   at 
> org.springframework.kafka.core.KafkaTemplate.lambda$buildCallback$0(KafkaTemplate.java:405)
>  ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:877)
>  ~[kafka-clients-2.0.1.jar:na]
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:803) 
> ~[kafka-clients-2.0.1.jar:na]
>   at 
> org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer.send(DefaultKafkaProducerFactory.java:444)
>  ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
>   at 
> org.springframework.kafka.core.KafkaTemplate.doSend(KafkaTemplate.java:381) 
> ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
>   at 
> org.springframework.kafka.core.KafkaTemplate.send(KafkaTemplate.java:193) 
> ~[spring-kafka-2.2.7.RELEASE.jar:2.2.7.RELEASE]
> ...
> {noformat}
> It looks like KafkaProducer requests two times for meta information for each 
> topic and hangs just before second request due to some deadlock. When 60s 
> pass TimeoutException is thrown and meta information is requested/received 
> immediately (but after exception has been already thrown).
> The issue in the example project is reproduced every time; and the use case 
> is trivial.
>  This is a critical bug for us.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9531) java.net.UnknownHostException loop on VM rolling update using CNAME

2020-02-10 Thread Rui Abreu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Abreu updated KAFKA-9531:
-
Component/s: network

> java.net.UnknownHostException loop on VM rolling update using CNAME
> ---
>
> Key: KAFKA-9531
> URL: https://issues.apache.org/jira/browse/KAFKA-9531
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, controller, network, producer 
>Affects Versions: 2.4.0
>Reporter: Rui Abreu
>Priority: Major
>
> Hello,
>  
> My cluster setup in based on VMs behind DNS CNAME .
> Example:  node.internal is a CNAME to either nodeA.internal or nodeB.internal
> Since kafka-client 1.2.1,  it has been observed that sometimes Kafka clients 
> get stuck on a loop with the exception:
> Example after nodeB.internal is replaced with nodeA.internal 
>  
> {code:java}
> 2020-02-10T12:11:28.181Z o.a.k.c.NetworkClient [WARN]- [Consumer 
> clientId=consumer-6, groupId=consumer.group] Error connecting to node 
> nodeB.internal:9092 (id: 2 rack: null)
> java.net.UnknownHostException: nodeB.internal:9092
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1281) 
> ~[?:1.8.0_222]
>   at java.net.InetAddress.getAllByName(InetAddress.java:1193) 
> ~[?:1.8.0_222]
>   at java.net.InetAddress.getAllByName(InetAddress.java:1127) 
> ~[?:1.8.0_222]
>   at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:403)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:943)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:68) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1114)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1005)
>  ~[stormjar.jar:?]
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:537) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:161)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:366)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1251)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1220) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.kafka.spout.KafkaSpout.pollKafkaBroker(KafkaSpout.java:365) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:294) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.daemon.executor$fn__10715$fn__10730$fn__10761.invoke(executor.clj:649)
>  ~[storm-core-1.1.3.jar:1.1.3]
>   at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) 
> ~[storm-core-1.1.3.jar:1.1.3]
>   at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.7.0.jar:?]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> {code}
>  
> The time it spends in the loop is arbitrary, but it seems the client 
> effectively stops while this is happening.
> This error contrasts with instances where the client is able to recover on 
> its own after a few seconds:
> {code:java}
> 2020-02-08T01:15:37.390Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
> clientId=consumer-7, groupId=consumer-group] Group coordinator 
> nodeA.internal:9092 (id: 2147483645 rack: null) is unavailable or invalid, 
> will attempt rediscovery
>  
> 2020-02-08T01:15:37.885Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
> clientId=consumer-7, groupId=consumer-group] Discovered group coordinator 
> nodeB.internal:9092 (id: 

[jira] [Commented] (KAFKA-9504) Memory leak in KafkaMetrics registered to MBean

2020-02-10 Thread Murilo Tavares (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033663#comment-17033663
 ] 

Murilo Tavares commented on KAFKA-9504:
---

I just ran the above against version 2.3.1, and it looks fine on that release. 
So apparently this was really introduced on 2.4.0.

> Memory leak in KafkaMetrics registered to MBean
> ---
>
> Key: KAFKA-9504
> URL: https://issues.apache.org/jira/browse/KAFKA-9504
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.4.0
>Reporter: Andreas Holmén
>Priority: Major
>
> After close() called on a KafkaConsumer some registered MBeans are not 
> unregistered causing leak.
>  
>  
> {code:java}
> import static 
> org.apache.kafka.clients.consumer.ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG;
> import java.lang.management.ManagementFactory;
> import java.util.HashMap;
> import java.util.Map;
> import javax.management.MBeanServer;
> import org.apache.kafka.clients.consumer.KafkaConsumer;
> import org.apache.kafka.common.serialization.ByteArrayDeserializer;
> public class Leaker {
>  private static String bootstrapServers = "hostname:9092";
>  
>  public static void main(String[] args) throws InterruptedException {
>   MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
>   Map props = new HashMap<>();
>   props.put(BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
>  
>   int beans = mBeanServer.getMBeanCount();
>   for (int i = 0; i < 100; i++) {
>KafkaConsumer consumer = new KafkaConsumer<>(props, new 
> ByteArrayDeserializer(), new ByteArrayDeserializer());
>consumer.close();
>   }
>   int newBeans = mBeanServer.getMBeanCount();
>   System.out.println("\nbeans delta: " + (newBeans - beans));
>  }
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8904) Reduce metadata lookups when producing to a large number of topics

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033653#comment-17033653
 ] 

ASF GitHub Bot commented on KAFKA-8904:
---

rajinisivaram commented on pull request #7781: KAFKA-8904: Improve producer's 
topic metadata fetching.
URL: https://github.com/apache/kafka/pull/7781
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Reduce metadata lookups when producing to a large number of topics
> --
>
> Key: KAFKA-8904
> URL: https://issues.apache.org/jira/browse/KAFKA-8904
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, producer 
>Reporter: Brian Byrne
>Priority: Minor
>
> Per [~lbradstreet]:
>  
> "The problem was that the producer starts with no knowledge of topic 
> metadata. So they start the producer up, and then they start sending messages 
> to any of the thousands of topics that exist. Each time a message is sent to 
> a new topic, it'll trigger a metadata request if the producer doesn't know 
> about it. These metadata requests are done in serial such that if you send 
> 2000 messages to 2000 topics, it will trigger 2000 new metadata requests.
>  
> Each successive metadata request will include every topic seen so far, so the 
> first metadata request will include 1 topic, the second will include 2 
> topics, etc.
>  
> An additional problem is that this can take a while, and metadata expiry (for 
> metadata that has not been recently used) is hard coded to 5 mins, so if this 
> the initial fetches take long enough you can end up evicting the metadata 
> before you send another message to a topic.
> So the approaches above are:
> 1. We can linger for a bit before making a metadata request, allow more sends 
> to go through, and then batch the metadata request for topics we we need in a 
> single metadata request.
> 2. We can allow pre-seeding the producer with metadata for a list of topics 
> you care about.
> I prefer 1 if we can make it work."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9531) java.net.UnknownHostException loop on VM rolling update using CNAME

2020-02-10 Thread Rui Abreu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033650#comment-17033650
 ] 

Rui Abreu commented on KAFKA-9531:
--

Seems related to the following issues:

 

https://issues.apache.org/jira/browse/KAFKA-7890

https://issues.apache.org/jira/browse/KAFKA-8057

> java.net.UnknownHostException loop on VM rolling update using CNAME
> ---
>
> Key: KAFKA-9531
> URL: https://issues.apache.org/jira/browse/KAFKA-9531
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, controller, producer 
>Affects Versions: 2.4.0
>Reporter: Rui Abreu
>Priority: Major
>
> Hello,
>  
> My cluster setup in based on VMs behind DNS CNAME .
> Example:  node.internal is a CNAME to either nodeA.internal or nodeB.internal
> Since kafka-client 1.2.1,  it has been observed that sometimes Kafka clients 
> get stuck on a loop with the exception:
> Example after nodeB.internal is replaced with nodeA.internal 
>  
> {code:java}
> 2020-02-10T12:11:28.181Z o.a.k.c.NetworkClient [WARN]- [Consumer 
> clientId=consumer-6, groupId=consumer.group] Error connecting to node 
> nodeB.internal:9092 (id: 2 rack: null)
> java.net.UnknownHostException: nodeB.internal:9092
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1281) 
> ~[?:1.8.0_222]
>   at java.net.InetAddress.getAllByName(InetAddress.java:1193) 
> ~[?:1.8.0_222]
>   at java.net.InetAddress.getAllByName(InetAddress.java:1127) 
> ~[?:1.8.0_222]
>   at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:403)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:943)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:68) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1114)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1005)
>  ~[stormjar.jar:?]
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:537) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:161)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:366)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1251)
>  ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1220) 
> ~[stormjar.jar:?]
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.kafka.spout.KafkaSpout.pollKafkaBroker(KafkaSpout.java:365) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:294) 
> ~[stormjar.jar:?]
>   at 
> org.apache.storm.daemon.executor$fn__10715$fn__10730$fn__10761.invoke(executor.clj:649)
>  ~[storm-core-1.1.3.jar:1.1.3]
>   at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) 
> ~[storm-core-1.1.3.jar:1.1.3]
>   at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.7.0.jar:?]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> {code}
>  
> The time it spends in the loop is arbitrary, but it seems the client 
> effectively stops while this is happening.
> This error contrasts with instances where the client is able to recover on 
> its own after a few seconds:
> {code:java}
> 2020-02-08T01:15:37.390Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
> clientId=consumer-7, groupId=consumer-group] Group coordinator 
> nodeA.internal:9092 (id: 2147483645 rack: null) is unavailable or invalid, 
> will attempt rediscovery
>  
> 2020-02-08T01:15:37.885Z 

[jira] [Created] (KAFKA-9531) java.net.UnknownHostException loop on VM rolling update using CNAME

2020-02-10 Thread Rui Abreu (Jira)
Rui Abreu created KAFKA-9531:


 Summary: java.net.UnknownHostException loop on VM rolling update 
using CNAME
 Key: KAFKA-9531
 URL: https://issues.apache.org/jira/browse/KAFKA-9531
 Project: Kafka
  Issue Type: Bug
  Components: clients, controller, producer 
Affects Versions: 2.4.0
Reporter: Rui Abreu


Hello,

 

My cluster setup in based on VMs behind DNS CNAME .

Example:  node.internal is a CNAME to either nodeA.internal or nodeB.internal

Since kafka-client 1.2.1,  it has been observed that sometimes Kafka clients 
get stuck on a loop with the exception:
Example after nodeB.internal is replaced with nodeA.internal 

 
{code:java}
2020-02-10T12:11:28.181Z o.a.k.c.NetworkClient [WARN]- [Consumer 
clientId=consumer-6, groupId=consumer.group] Error connecting to node 
nodeB.internal:9092 (id: 2 rack: null)
java.net.UnknownHostException: nodeB.internal:9092
at java.net.InetAddress.getAllByName0(InetAddress.java:1281) 
~[?:1.8.0_222]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) 
~[?:1.8.0_222]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) 
~[?:1.8.0_222]
at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104) 
~[stormjar.jar:?]
at 
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:403)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:943) 
~[stormjar.jar:?]
at 
org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:68) 
~[stormjar.jar:?]
at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1114)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1005)
 ~[stormjar.jar:?]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:537) 
~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:161)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:366)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1251)
 ~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1220) 
~[stormjar.jar:?]
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) 
~[stormjar.jar:?]
at 
org.apache.storm.kafka.spout.KafkaSpout.pollKafkaBroker(KafkaSpout.java:365) 
~[stormjar.jar:?]
at 
org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:294) 
~[stormjar.jar:?]
at 
org.apache.storm.daemon.executor$fn__10715$fn__10730$fn__10761.invoke(executor.clj:649)
 ~[storm-core-1.1.3.jar:1.1.3]
at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) 
~[storm-core-1.1.3.jar:1.1.3]
at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
{code}
 

The time it spends in the loop is arbitrary, but it seems the client 
effectively stops while this is happening.

This error contrasts with instances where the client is able to recover on its 
own after a few seconds:


{code:java}
2020-02-08T01:15:37.390Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
clientId=consumer-7, groupId=consumer-group] Group coordinator 
nodeA.internal:9092 (id: 2147483645 rack: null) is unavailable or invalid, will 
attempt rediscovery
 
2020-02-08T01:15:37.885Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
clientId=consumer-7, groupId=consumer-group] Discovered group coordinator 
nodeB.internal:9092 (id: 2147483646 rack: null)

2020-02-08T01:15:37.885Z o.a.k.c.ClusterConnectionStates [INFO] - [Consumer 
clientId=consumer-7, groupId=consumer-group] Hostname for node 2147483646 
changed from nodeA.internal to nodeB.internal
{code}


   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9532) Deleting the consumer group programatically using RESTFul API

2020-02-10 Thread Rakshith Mamidala (Jira)
Rakshith  Mamidala created KAFKA-9532:
-

 Summary: Deleting the consumer group programatically using RESTFul 
API
 Key: KAFKA-9532
 URL: https://issues.apache.org/jira/browse/KAFKA-9532
 Project: Kafka
  Issue Type: Wish
  Components: clients
Affects Versions: 2.4.0
Reporter: Rakshith  Mamidala
 Fix For: 2.4.0


As a requirement in project, instead of listening the messages and consuming / 
storing message data into database, we are creating the consumer groups run 
time per user (to avoid thread safe issue) and using consumer.poll and 
consumer.seekToBeginning and once read all the messages we are closing the 
connection, unsubscribing consumer group. 

 

Whats happening in Kafka is, the consumer groups moved from active state to 
DEAD state but not getting removed / deleted, in Kafka Tools it shows all the 
consumers even if those are DEAD.

 

*What we want:*
 # How to remove / delete the consumer groups programatically.
 # Is there any REST Endpoint / command line / script to delete the consumer 
groups? What are those.
 # What impact the DEAD consumer groups can creates in terms of production 
environment.?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4090) JVM runs into OOM if (Java) client uses a SSL port without setting the security protocol

2020-02-10 Thread Alexandre Dupriez (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033588#comment-17033588
 ] 

Alexandre Dupriez commented on KAFKA-4090:
--

Thanks [~belugabehr] for sharing your results. Do you have a pre-PR with your 
changes so that I can have a look?

> JVM runs into OOM if (Java) client uses a SSL port without setting the 
> security protocol
> 
>
> Key: KAFKA-4090
> URL: https://issues.apache.org/jira/browse/KAFKA-4090
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.9.0.1, 0.10.0.1, 2.1.0
>Reporter: Jaikiran Pai
>Assignee: Alexandre Dupriez
>Priority: Major
>
> Quoting from the mail thread that was sent to Kafka mailing list:
> {quote}
> We have been using Kafka 0.9.0.1 (server and Java client libraries). So far 
> we had been using it with plaintext transport but recently have been 
> considering upgrading to using SSL. It mostly works except that a 
> mis-configured producer (and even consumer) causes a hard to relate 
> OutOfMemory exception and thus causing the JVM in which the client is 
> running, to go into a bad state. We can consistently reproduce that OOM very 
> easily. We decided to check if this is something that is fixed in 0.10.0.1 so 
> upgraded one of our test systems to that version (both server and client 
> libraries) but still see the same issue. Here's how it can be easily 
> reproduced
> 1. Enable SSL listener on the broker via server.properties, as per the Kafka 
> documentation
> {code}
> listeners=PLAINTEXT://:9092,SSL://:9093
> ssl.keystore.location=
> ssl.keystore.password=pass
> ssl.key.password=pass
> ssl.truststore.location=
> ssl.truststore.password=pass
> {code}
> 2. Start zookeeper and kafka server
> 3. Create a "oom-test" topic (which will be used for these tests):
> {code}
> kafka-topics.sh --zookeeper localhost:2181 --create --topic oom-test  
> --partitions 1 --replication-factor 1
> {code}
> 4. Create a simple producer which sends a single message to the topic via 
> Java (new producer) APIs:
> {code}
> public class OOMTest {
> public static void main(final String[] args) throws Exception {
> final Properties kafkaProducerConfigs = new Properties();
> // NOTE: Intentionally use a SSL port without specifying 
> security.protocol as SSL
> 
> kafkaProducerConfigs.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, 
> "localhost:9093");
> 
> kafkaProducerConfigs.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, 
> StringSerializer.class.getName());
> 
> kafkaProducerConfigs.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
>  StringSerializer.class.getName());
> try (KafkaProducer producer = new 
> KafkaProducer<>(kafkaProducerConfigs)) {
> System.out.println("Created Kafka producer");
> final String topicName = "oom-test";
> final String message = "Hello OOM!";
> // send a message to the topic
> final Future recordMetadataFuture = 
> producer.send(new ProducerRecord<>(topicName, message));
> final RecordMetadata sentRecordMetadata = 
> recordMetadataFuture.get();
> System.out.println("Sent message '" + message + "' to topic '" + 
> topicName + "'");
> }
> System.out.println("Tests complete");
> }
> }
> {code}
> Notice that the server URL is using a SSL endpoint localhost:9093 but isn't 
> specifying any of the other necessary SSL configs like security.protocol.
> 5. For the sake of easily reproducing this issue run this class with a max 
> heap size of 256MB (-Xmx256M). Running this code throws up the following 
> OutOfMemoryError in one of the Sender threads:
> {code}
> 18:33:25,770 ERROR [KafkaThread] - Uncaught exception in 
> kafka-producer-network-thread | producer-1:
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at 
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:93)
> at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
> at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
> at 
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:256)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:216)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:128)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Note that I set 

[jira] [Commented] (KAFKA-9423) Refine layout of configuration options on website and make individual settings directly linkable

2020-02-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033569#comment-17033569
 ] 

ASF GitHub Bot commented on KAFKA-9423:
---

mimaison commented on pull request #251: KAFKA-9423: Refine layout of 
configuration options on website and make individual settings directly linkable
URL: https://github.com/apache/kafka-site/pull/251
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refine layout of configuration options on website and make individual 
> settings directly linkable
> 
>
> Key: KAFKA-9423
> URL: https://issues.apache.org/jira/browse/KAFKA-9423
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sönke Liebau
>Assignee: Sönke Liebau
>Priority: Trivial
> Attachments: option1.png, option2.png, option3.png, option4.png
>
>
> KAFKA-8474 changed the layout of configuration options on the website from a 
> table which over time ran out of horizontal space to a list.
> This vastly improved readability but is not yet ideal. Further discussion was 
> had in the [PR|https://github.com/apache/kafka/pull/6870] for KAFKA-8474.
> This ticket is to move that discussion to a separate thread and make it more 
> visible to other people and to give subsequent PRs a home.
> Currently proposed options are attached to this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7052) ExtractField SMT throws NPE - needs clearer error message

2020-02-10 Thread Robin Moffatt (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033473#comment-17033473
 ] 

Robin Moffatt commented on KAFKA-7052:
--

I agree that even just improving the error message would be a good first-step 
here. 

> ExtractField SMT throws NPE - needs clearer error message
> -
>
> Key: KAFKA-7052
> URL: https://issues.apache.org/jira/browse/KAFKA-7052
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Reporter: Robin Moffatt
>Priority: Major
>
> With the following Single Message Transform: 
> {code:java}
> "transforms.ExtractId.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
> "transforms.ExtractId.field":"id"{code}
> Kafka Connect errors with : 
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.kafka.connect.transforms.ExtractField.apply(ExtractField.java:61)
> at 
> org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:38){code}
> There should be a better error message here, identifying the reason for the 
> NPE.
> Version: Confluent Platform 4.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)