[jira] [Commented] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736529#comment-16736529 ] ASF GitHub Bot commented on KAFKA-7768: --- guozhangwang commented on pull request #178: KAFKA-7768: Use absolute paths for javadoc URL: https://github.com/apache/kafka-site/pull/178 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > Fix For: 2.2.0, 2.1.1 > > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736423#comment-16736423 ] ASF GitHub Bot commented on KAFKA-7768: --- vvcephei commented on pull request #178: KAFKA-7768: Use absolute paths for javadoc URL: https://github.com/apache/kafka-site/pull/178 The breakage observed in KAFKA-7768 was only when following a link from a page with a non-versioned URL. When starting from a versioned url, the number of ../ up-paths was correct, so inserting the version actually breaks it. To avoid this situation, this PR switches to using absolute paths in conjunction with the version number, as other links in the docs do. See also #176, https://github.com/apache/kafka/pull/6094, and https://github.com/apache/kafka/pull/6100 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > Fix For: 2.2.0, 2.1.1 > > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7786) Fast update of leader epoch may stall partition fetching due to FENCED_LEADER_EPOCH
[ https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736420#comment-16736420 ] Anna Povzner commented on KAFKA-7786: - Hi [~junrao], thanks for the comment. I recently talked to [~hachikuji] and he suggested the same fix, with no protocol change. I just opened a PR: [https://github.com/apache/kafka/pull/6101.] > Fast update of leader epoch may stall partition fetching due to > FENCED_LEADER_EPOCH > --- > > Key: KAFKA-7786 > URL: https://issues.apache.org/jira/browse/KAFKA-7786 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Anna Povzner >Assignee: Anna Povzner >Priority: Critical > > KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a > OffsetsForLeaderEpoch request if the epoch in the request is lower than the > broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch > request under _partitionMapLock_, sends the request outside the lock, and > then processes the response under _partitionMapLock_. The broker may receive > LeaderAndIsr with the same leader but with the next leader epoch, remove and > add partition to the fetcher thread (with partition state reflecting the > updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the > old leader epoch) is still outstanding/ waiting for the lock to process the > OffsetsForLeaderEpoch response. As a result, partition gets removed from > partitionStates and this broker will not fetch for this partition until the > next LeaderAndIsr which may take a while. We will see log message like this: > [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, > fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current > leader. Will await the new LeaderAndIsr state before resuming fetching. > (kafka.server.ReplicaFetcherThread) > We saw this happen with > kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True. > This test does partition re-assignment while bouncing 2 out of 4 total > brokers. When the failure happen, each bounced broker was also a controller. > Because of re-assignment, the controller updates leader epoch without > updating the leader on controller change or on broker startup, so we see > several leader epoch changes without the leader change, which increases the > likelihood of the race condition described above. > Here is exact events that happen in this test (around the failure): > We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for > test_topic-17 [2, 4, 1] —> [3, 1, 2]. At time t0, leader of test_topic-17 is > broker 2. > # clean shutdown of broker 3, which is also a controller > # broker 4 becomes controller, continues re-assignment and updates leader > epoch for test_topic-17 to 6 (with same leader) > # broker 2 (leader of test_topic-17) receives new leader epoch: > “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader > Epoch was: 5” > # broker 3 is started again after clean shutdown > # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) > to broker 3 > # controller updates leader epoch to 7 > # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader epoch > 7: “test_topic-17 starts at Leader Epoch 7 from offset 1974. Previous Leader > Epoch was: 6” > # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 6 from > controller: “Added fetcher to broker BrokerEndPoint(id=2) for leader epoch 6” > and sends OffsetsForLeaderEpoch request to broker 2 > # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 7 from > controller; removes fetcher thread and adds fetcher thread + executes > AbstractFetcherThread.addPartitions() which updates partition state with > leader epoch 7 > # broker 3 receives FENCED_LEADER_EPOCH in response to > OffsetsForLeaderEpoch(leader epoch 6), because the leader received > LeaderAndIsr for leader epoch 7 before it got OffsetsForLeaderEpoch(leader > epoch 6) from broker 3. As a result, it removes partition from > partitionStates and it does not fetch until controller updates leader epoch > and sends LeaderAndIsr for this partition to broker 3. The test fails, > because re-assignment does not finish on time (due to broker 3 not fetching). > > One way to address this is possibly add more state to PartitionFetchState. > However, we may introduce other race condition. A cleaner way, I think, is to > return leader epoch in the OffsetsForLeaderEpoch response with > FENCED_LEADER_EPOCH error, and then ignore the error if partition state > contains a higher leader epoch. The adv
[jira] [Commented] (KAFKA-7786) Fast update of leader epoch may stall partition fetching due to FENCED_LEADER_EPOCH
[ https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736415#comment-16736415 ] Jun Rao commented on KAFKA-7786: Great find, Anna. About the fix. I am not sure that we need the protocol change. We know the leader epoch used in OffsetsForLeaderEpoch request. We can just pass that along with the OffsetsForLeaderEpoch response to maybeTruncate(). If the leaderEpoch in partitionStates has changed, we simply ignore the response and retry the OffsetsForLeaderEpoch request with the new leader epoch. > Fast update of leader epoch may stall partition fetching due to > FENCED_LEADER_EPOCH > --- > > Key: KAFKA-7786 > URL: https://issues.apache.org/jira/browse/KAFKA-7786 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Anna Povzner >Assignee: Anna Povzner >Priority: Critical > > KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a > OffsetsForLeaderEpoch request if the epoch in the request is lower than the > broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch > request under _partitionMapLock_, sends the request outside the lock, and > then processes the response under _partitionMapLock_. The broker may receive > LeaderAndIsr with the same leader but with the next leader epoch, remove and > add partition to the fetcher thread (with partition state reflecting the > updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the > old leader epoch) is still outstanding/ waiting for the lock to process the > OffsetsForLeaderEpoch response. As a result, partition gets removed from > partitionStates and this broker will not fetch for this partition until the > next LeaderAndIsr which may take a while. We will see log message like this: > [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, > fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current > leader. Will await the new LeaderAndIsr state before resuming fetching. > (kafka.server.ReplicaFetcherThread) > We saw this happen with > kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True. > This test does partition re-assignment while bouncing 2 out of 4 total > brokers. When the failure happen, each bounced broker was also a controller. > Because of re-assignment, the controller updates leader epoch without > updating the leader on controller change or on broker startup, so we see > several leader epoch changes without the leader change, which increases the > likelihood of the race condition described above. > Here is exact events that happen in this test (around the failure): > We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for > test_topic-17 [2, 4, 1] —> [3, 1, 2]. At time t0, leader of test_topic-17 is > broker 2. > # clean shutdown of broker 3, which is also a controller > # broker 4 becomes controller, continues re-assignment and updates leader > epoch for test_topic-17 to 6 (with same leader) > # broker 2 (leader of test_topic-17) receives new leader epoch: > “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader > Epoch was: 5” > # broker 3 is started again after clean shutdown > # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) > to broker 3 > # controller updates leader epoch to 7 > # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader epoch > 7: “test_topic-17 starts at Leader Epoch 7 from offset 1974. Previous Leader > Epoch was: 6” > # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 6 from > controller: “Added fetcher to broker BrokerEndPoint(id=2) for leader epoch 6” > and sends OffsetsForLeaderEpoch request to broker 2 > # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 7 from > controller; removes fetcher thread and adds fetcher thread + executes > AbstractFetcherThread.addPartitions() which updates partition state with > leader epoch 7 > # broker 3 receives FENCED_LEADER_EPOCH in response to > OffsetsForLeaderEpoch(leader epoch 6), because the leader received > LeaderAndIsr for leader epoch 7 before it got OffsetsForLeaderEpoch(leader > epoch 6) from broker 3. As a result, it removes partition from > partitionStates and it does not fetch until controller updates leader epoch > and sends LeaderAndIsr for this partition to broker 3. The test fails, > because re-assignment does not finish on time (due to broker 3 not fetching). > > One way to address this is possibly add more state to PartitionFetchState. > However, we may introduce other race condition. A cleaner way, I think, is
[jira] [Commented] (KAFKA-7786) Fast update of leader epoch may stall partition fetching due to FENCED_LEADER_EPOCH
[ https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736416#comment-16736416 ] ASF GitHub Bot commented on KAFKA-7786: --- apovzner commented on pull request #6101: KAFKA-7786: Ignore OffsetsForLeaderEpoch response if leader epoch changed while request in flight URL: https://github.com/apache/kafka/pull/6101 There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr. Our system test kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True failed 3 times due to this error in the last couple of months. Since this test is already able to test this condition, not adding any more tests. Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure. cc @hachikuji who suggested the fix. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fast update of leader epoch may stall partition fetching due to > FENCED_LEADER_EPOCH > --- > > Key: KAFKA-7786 > URL: https://issues.apache.org/jira/browse/KAFKA-7786 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Anna Povzner >Assignee: Anna Povzner >Priority: Critical > > KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a > OffsetsForLeaderEpoch request if the epoch in the request is lower than the > broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch > request under _partitionMapLock_, sends the request outside the lock, and > then processes the response under _partitionMapLock_. The broker may receive > LeaderAndIsr with the same leader but with the next leader epoch, remove and > add partition to the fetcher thread (with partition state reflecting the > updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the > old leader epoch) is still outstanding/ waiting for the lock to process the > OffsetsForLeaderEpoch response. As a result, partition gets removed from > partitionStates and this broker will not fetch for this partition until the > next LeaderAndIsr which may take a while. We will see log message like this: > [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, > fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current > leader. Will await the new LeaderAndIsr state before resuming fetching. > (kafka.server.ReplicaFetcherThread) > We saw this happen with > kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True. > This test does partition re-assignment while bouncing 2 out of 4 total > brokers. When the failure happen, each bounced broker was also a controller. > Because of re-assignment, the controller updates leader epoch without > updating the leader on controller change or on broker startup, so we see > several leader epoch changes without the leader change, which increases the > likelihood of the race condition described above. > Here is exact events that happen in this test (around the failure): > We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for > test_topic-17 [2, 4, 1] —> [3, 1, 2]. At time t0, leader of test_topic-17 is > broker 2. > # clean shutdown of broker 3, which is also a controller > # broker 4 becomes controller, continues re-assignment and updates leader > epoch for test_topic-17 to 6 (with same leader) > # broker 2 (leader of test_topic-17) receives new leader epoch: > “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader > Epoch was: 5” > # broker 3 is started again after clean shutdown > # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) > to broker 3 > # controller updates leader epoch to 7 > # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader ep
[jira] [Commented] (KAFKA-6460) Add mocks for state stores used in Streams unit testing
[ https://issues.apache.org/jira/browse/KAFKA-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736406#comment-16736406 ] Guozhang Wang commented on KAFKA-6460: -- Great! Feel free to look at KIP-267 for the context of this: https://cwiki.apache.org/confluence/display/KAFKA/KIP-267%3A+Add+Processor+Unit+Test+Support+to+Kafka+Streams+Test+Utils . > Add mocks for state stores used in Streams unit testing > --- > > Key: KAFKA-6460 > URL: https://issues.apache.org/jira/browse/KAFKA-6460 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > We'd like to use mocks for different types of state stores: kv, window, > session that can be used to record the number of expected put / get calls > used in the DSL operator unit testing. This involves implementing the two > interfaces {{StoreSupplier}} and {{StoreBuilder}} that can return a object > created from, say, EasyMock, and the object can then be set up with the > expected calls. > In addition, we should also add a mock record collector which can be returned > from the mock processor context so that with logging enabled store, users can > also validate if the changes have been forwarded to the changelog as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6460) Add mocks for state stores used in Streams unit testing
[ https://issues.apache.org/jira/browse/KAFKA-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736402#comment-16736402 ] Yishun Guan commented on KAFKA-6460: Hi [~guozhang] , I am still interested if nobody else is looking into this. Will take some time since I am not familiar with the test packages, but I can definitely do some researches and take a stab at it. > Add mocks for state stores used in Streams unit testing > --- > > Key: KAFKA-6460 > URL: https://issues.apache.org/jira/browse/KAFKA-6460 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > We'd like to use mocks for different types of state stores: kv, window, > session that can be used to record the number of expected put / get calls > used in the DSL operator unit testing. This involves implementing the two > interfaces {{StoreSupplier}} and {{StoreBuilder}} that can return a object > created from, say, EasyMock, and the object can then be set up with the > expected calls. > In addition, we should also add a mock record collector which can be returned > from the mock processor context so that with logging enabled store, users can > also validate if the changes have been forwarded to the changelog as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7793) Improve the Trogdor command-line
Colin P. McCabe created KAFKA-7793: -- Summary: Improve the Trogdor command-line Key: KAFKA-7793 URL: https://issues.apache.org/jira/browse/KAFKA-7793 Project: Kafka Issue Type: Improvement Reporter: Colin P. McCabe Improve the Trogdor command-line. It should be easier to launch tasks from a task spec in a file. It should be easier to list the currently-running tasks in a readable way. We should be able to filter the currently-running tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6460) Add mocks for state stores used in Streams unit testing
[ https://issues.apache.org/jira/browse/KAFKA-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736399#comment-16736399 ] Guozhang Wang commented on KAFKA-6460: -- It should be part of the public test-utils package, and this ticket also involves 1) adding a {{StoreTestDriver}} which can provide mock {{StoreSupplier}}s and {{StoreBuilder}}s, which can provide different mocks of {{StateStore}}s; 2) and these state stores should be using the {{MockProcessorContext}} existed in the test-utils already, and 3) refactoring the internal unit tests of Streams to leverage on the newly added state store / supplier / record collector and remove any of these vanilla ones (e.g. {{KeyValueStoreTestDriver}}). [~shung] Are you interested in picking this up? > Add mocks for state stores used in Streams unit testing > --- > > Key: KAFKA-6460 > URL: https://issues.apache.org/jira/browse/KAFKA-6460 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Reporter: Guozhang Wang >Priority: Major > Labels: newbie++ > > We'd like to use mocks for different types of state stores: kv, window, > session that can be used to record the number of expected put / get calls > used in the DSL operator unit testing. This involves implementing the two > interfaces {{StoreSupplier}} and {{StoreBuilder}} that can return a object > created from, say, EasyMock, and the object can then be set up with the > expected calls. > In addition, we should also add a mock record collector which can be returned > from the mock processor context so that with logging enabled store, users can > also validate if the changes have been forwarded to the changelog as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-4218) Enable access to key in ValueTransformer
[ https://issues.apache.org/jira/browse/KAFKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang resolved KAFKA-4218. -- Resolution: Fixed Fix Version/s: (was: 1.1.0) 2.1.0 This has been resolved by https://github.com/apache/kafka/commit/bcc712b45820da74b44209857ebbf7b9d59e0ed7 from [~jeyhunkarimov] > Enable access to key in ValueTransformer > > > Key: KAFKA-4218 > URL: https://issues.apache.org/jira/browse/KAFKA-4218 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 0.10.0.1 >Reporter: Elias Levy >Assignee: Jeyhun Karimov >Priority: Major > Labels: api, kip > Fix For: 2.1.0 > > > While transforming values via {{KStream.transformValues}} and > {{ValueTransformer}}, the key associated with the value may be needed, even > if it is not changed. For instance, it may be used to access stores. > As of now, the key is not available within these methods and interfaces, > leading to the use of {{KStream.transform}} and {{Transformer}}, and the > unnecessary creation of new {{KeyValue}} objects. > KIP-149: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-149%3A+Enabling+key+access+in+ValueTransformer%2C+ValueMapper%2C+and+ValueJoiner -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736382#comment-16736382 ] ASF GitHub Bot commented on KAFKA-7768: --- guozhangwang commented on pull request #6100: KAFKA-7768: Use absolute paths for javadoc URL: https://github.com/apache/kafka/pull/6100 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > Fix For: 2.2.0, 2.1.1 > > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (KAFKA-3729) Auto-configure non-default SerDes passed alongside the topology builder
[ https://issues.apache.org/jira/browse/KAFKA-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang reassigned KAFKA-3729: Assignee: (was: Bharat Viswanadham) > Auto-configure non-default SerDes passed alongside the topology builder > > > Key: KAFKA-3729 > URL: https://issues.apache.org/jira/browse/KAFKA-3729 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Fred Patton >Priority: Major > Labels: api, newbie > > From Guozhang Wang: > "Only default serdes provided through configs are auto-configured today. But > we could auto-configure other serdes passed alongside the topology builder as > well." -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-4468) Correctly calculate the window end timestamp after read from state stores
[ https://issues.apache.org/jira/browse/KAFKA-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-4468: - Fix Version/s: 2.2.0 > Correctly calculate the window end timestamp after read from state stores > - > > Key: KAFKA-4468 > URL: https://issues.apache.org/jira/browse/KAFKA-4468 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Guozhang Wang >Assignee: Richard Yu >Priority: Major > Labels: architecture > Fix For: 2.2.0 > > > When storing the WindowedStore on the persistent KV store, we only use the > start timestamp of the window as part of the combo-key as (start-timestamp, > key). The reason that we do not add the end-timestamp as well is that we can > always calculate it from the start timestamp + window_length, and hence we > can save 8 bytes per key on the persistent KV store. > However, after read it (via {{WindowedDeserializer}}) we do not set its end > timestamp correctly but just read it as an {{UnlimitedWindow}}. We should fix > this by calculating its end timestamp as mentioned above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-4468) Correctly calculate the window end timestamp after read from state stores
[ https://issues.apache.org/jira/browse/KAFKA-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736369#comment-16736369 ] Guozhang Wang commented on KAFKA-4468: -- With https://cwiki.apache.org/confluence/display/KAFKA/KIP-393%3A+Time+windowed+serde+to+properly+deserialize+changelog+input+topic merged in, we can finally resolve this issue now. > Correctly calculate the window end timestamp after read from state stores > - > > Key: KAFKA-4468 > URL: https://issues.apache.org/jira/browse/KAFKA-4468 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: architecture > > When storing the WindowedStore on the persistent KV store, we only use the > start timestamp of the window as part of the combo-key as (start-timestamp, > key). The reason that we do not add the end-timestamp as well is that we can > always calculate it from the start timestamp + window_length, and hence we > can save 8 bytes per key on the persistent KV store. > However, after read it (via {{WindowedDeserializer}}) we do not set its end > timestamp correctly but just read it as an {{UnlimitedWindow}}. We should fix > this by calculating its end timestamp as mentioned above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-4468) Correctly calculate the window end timestamp after read from state stores
[ https://issues.apache.org/jira/browse/KAFKA-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang resolved KAFKA-4468. -- Resolution: Fixed Assignee: Richard Yu > Correctly calculate the window end timestamp after read from state stores > - > > Key: KAFKA-4468 > URL: https://issues.apache.org/jira/browse/KAFKA-4468 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Guozhang Wang >Assignee: Richard Yu >Priority: Major > Labels: architecture > > When storing the WindowedStore on the persistent KV store, we only use the > start timestamp of the window as part of the combo-key as (start-timestamp, > key). The reason that we do not add the end-timestamp as well is that we can > always calculate it from the start timestamp + window_length, and hence we > can save 8 bytes per key on the persistent KV store. > However, after read it (via {{WindowedDeserializer}}) we do not set its end > timestamp correctly but just read it as an {{UnlimitedWindow}}. We should fix > this by calculating its end timestamp as mentioned above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7051) Improve the efficiency of the ReplicaManager when there are many partitions
[ https://issues.apache.org/jira/browse/KAFKA-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736367#comment-16736367 ] ASF GitHub Bot commented on KAFKA-7051: --- cmccabe commented on pull request #5206: KAFKA-7051: Improve the efficiency of ReplicaManager URL: https://github.com/apache/kafka/pull/5206 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve the efficiency of the ReplicaManager when there are many partitions > --- > > Key: KAFKA-7051 > URL: https://issues.apache.org/jira/browse/KAFKA-7051 > Project: Kafka > Issue Type: Bug > Components: replication >Affects Versions: 0.8.0 >Reporter: Colin P. McCabe >Assignee: Colin P. McCabe >Priority: Major > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7699) Improve wall-clock time punctuations
[ https://issues.apache.org/jira/browse/KAFKA-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736364#comment-16736364 ] Yishun Guan commented on KAFKA-7699: Hi, just curious (I am not an expert in this), why won't _ScheduledExecutorService_ works __ in this case? thanks! > Improve wall-clock time punctuations > > > Key: KAFKA-7699 > URL: https://issues.apache.org/jira/browse/KAFKA-7699 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Matthias J. Sax >Priority: Major > Labels: needs-kip > > Currently, wall-clock time punctuation allow to schedule periodic call backs > based on wall-clock time progress. The punctuation time starts, when the > punctuation is scheduled, thus, it's non-deterministic what is desired for > many use cases (I want a call-back in 5 minutes from "now"). > It would be a nice improvement, to allow users to "anchor" wall-clock > punctation, too, similar to a cron job: Thus, a punctuation would be > triggered at "fixed" times like the beginning of the next hour, independent > when the punctuation was registered. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-7699) Improve wall-clock time punctuations
[ https://issues.apache.org/jira/browse/KAFKA-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736364#comment-16736364 ] Yishun Guan edited comment on KAFKA-7699 at 1/7/19 9:16 PM: Hi, just curious (I am not an expert in this), why won't _ScheduledExecutorService_ works in this case? thanks! was (Author: shung): Hi, just curious (I am not an expert in this), why won't _ScheduledExecutorService_ works __ in this case? thanks! > Improve wall-clock time punctuations > > > Key: KAFKA-7699 > URL: https://issues.apache.org/jira/browse/KAFKA-7699 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Matthias J. Sax >Priority: Major > Labels: needs-kip > > Currently, wall-clock time punctuation allow to schedule periodic call backs > based on wall-clock time progress. The punctuation time starts, when the > punctuation is scheduled, thus, it's non-deterministic what is desired for > many use cases (I want a call-back in 5 minutes from "now"). > It would be a nice improvement, to allow users to "anchor" wall-clock > punctation, too, similar to a cron job: Thus, a punctuation would be > triggered at "fixed" times like the beginning of the next hour, independent > when the punctuation was registered. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736336#comment-16736336 ] ASF GitHub Bot commented on KAFKA-7768: --- vvcephei commented on pull request #6100: KAFKA-7768: Use absolute paths for javadoc URL: https://github.com/apache/kafka/pull/6100 The breakage observed in KAFKA-7768 was only when following a link from a page with a non-versioned URL. When starting from a versioned url, the number of `../` up-paths was correct, so inserting the version actually breaks it. To avoid this situation, this PR switches to using absolute paths in conjunction with the version number, as other links in the docs do. See also #6094 ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > Fix For: 2.2.0, 2.1.1 > > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7102) Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20
[ https://issues.apache.org/jira/browse/KAFKA-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Hauch updated KAFKA-7102: - Component/s: (was: KafkaConnect) core > Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20 > --- > > Key: KAFKA-7102 > URL: https://issues.apache.org/jira/browse/KAFKA-7102 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.0 >Reporter: sankar >Priority: Critical > Attachments: kafka_java_io_exception.txt > > > we faced Error in fetch > kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20 > Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20 > We have four node kafka cluster in production environment. We experienced > suddenly kafka connect issue across cluster. > manual restart kafka service on all the nodes fixed the issue. > I attached the complete log. Please check the log. > kindly let me know what information more needed from my side. > Thanks in advance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7792) Trogdor should have an uptime function
Colin P. McCabe created KAFKA-7792: -- Summary: Trogdor should have an uptime function Key: KAFKA-7792 URL: https://issues.apache.org/jira/browse/KAFKA-7792 Project: Kafka Issue Type: Improvement Reporter: Colin P. McCabe Assignee: Colin P. McCabe Trogdor should have an uptime function which returns how long the coordinator or agent has been up. This will also be a good way to test that the daemon is running without fetching a full status. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7160) Add check for group ID length
[ https://issues.apache.org/jira/browse/KAFKA-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736238#comment-16736238 ] roo commented on KAFKA-7160: PR created no 6098, but I can't assign the issue to myself > Add check for group ID length > - > > Key: KAFKA-7160 > URL: https://issues.apache.org/jira/browse/KAFKA-7160 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: lambdaliu >Priority: Minor > Labels: newbie > > We should limit the length of the group ID, because other system(such as > monitor system) would use the group ID when we using kafka in production > environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7160) Add check for group ID length
[ https://issues.apache.org/jira/browse/KAFKA-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736220#comment-16736220 ] ASF GitHub Bot commented on KAFKA-7160: --- rosama86 commented on pull request #6098: KAFKA-7160 Add check for group ID length URL: https://github.com/apache/kafka/pull/6098 Description: validate group id length is <= 265 byte Testing: added 2 test cases to validate behavior of using group-id = 265 [border] & group-id > 265 ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [X] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add check for group ID length > - > > Key: KAFKA-7160 > URL: https://issues.apache.org/jira/browse/KAFKA-7160 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: lambdaliu >Priority: Minor > Labels: newbie > > We should limit the length of the group ID, because other system(such as > monitor system) would use the group ID when we using kafka in production > environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time
[ https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-7755. --- Resolution: Fixed Fix Version/s: 2.1.1 2.2.0 > Kubernetes - Kafka clients are resolving DNS entries only one time > -- > > Key: KAFKA-7755 > URL: https://issues.apache.org/jira/browse/KAFKA-7755 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 > Environment: Kubernetes >Reporter: Loïc Monney >Priority: Blocker > Fix For: 2.2.0, 2.1.1 > > Attachments: pom.xml > > > *Introduction* > Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses > if the first one fails. This change has been introduced by > https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution > is now performed only one time by the clients. This is not a problem if all > brokers have fixed IP addresses, however this is definitely an issue when > Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will > receive another IP address, so as soon as all brokers will have been > restarted clients won't be able to reconnect to any broker. > *Impact* > Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a > rolling restart is performed. > *Root cause* > Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are > resolving DNS entries only once. > *Proposed solution* > In > [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368] > Kafka clients should perform the DNS resolution again when all IP addresses > have been "used" (when _index_ is back to 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7160) Add check for group ID length
[ https://issues.apache.org/jira/browse/KAFKA-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736147#comment-16736147 ] Radwa Osama commented on KAFKA-7160: Hi all, it seems that no one is working on this one, so will look into it > Add check for group ID length > - > > Key: KAFKA-7160 > URL: https://issues.apache.org/jira/browse/KAFKA-7160 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: lambdaliu >Priority: Minor > Labels: newbie > > We should limit the length of the group ID, because other system(such as > monitor system) would use the group ID when we using kafka in production > environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-5061) client.id should be set for Connect producers/consumers
[ https://issues.apache.org/jira/browse/KAFKA-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736139#comment-16736139 ] Paul Davidson commented on KAFKA-5061: -- I have created a alternative PR after comments from [~ewencp] and [~kkonstantine] on KIP-411. This PR simply changes the default client id and avoids creating any new configuration options. See https://github.com/apache/kafka/pull/6097 > client.id should be set for Connect producers/consumers > --- > > Key: KAFKA-5061 > URL: https://issues.apache.org/jira/browse/KAFKA-5061 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 0.10.2.1 >Reporter: Ewen Cheslack-Postava >Priority: Major > Labels: needs-kip, newbie++ > > In order to properly monitor individual tasks using the producer and consumer > metrics, we need to have the framework disambiguate them. Currently when we > create producers > (https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L362) > and create consumers > (https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L371-L394) > the client ID is not being set. You can override it for the entire worker > via worker-level producer/consumer overrides, but you can't get per-task > metrics. > There are a couple of things we might want to consider doing here: > 1. Provide default client IDs based on the worker group ID + task ID > (providing uniqueness for multiple connect clusters up to the scope of the > Kafka cluster they are operating on). This seems ideal since it's a good > default; however it is a public-facing change and may need a KIP. Normally I > would be less worried about this, but some folks may be relying on picking up > metrics without this being set, in which case such a change would break their > monitoring. > 2. Allow overriding client.id on a per-connector basis. I'm not sure if this > will really be useful or not -- it lets you differentiate between metrics for > different connectors' tasks, but within a connector, all metrics would go to > a single client.id. On the other hand, this makes the tasks act as a single > group from the perspective of broker handling of client IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-3184) Add Checkpoint for In-memory State Store
[ https://issues.apache.org/jira/browse/KAFKA-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736113#comment-16736113 ] Guozhang Wang commented on KAFKA-3184: -- [~Yohan123] I think you are right: currently `persistent()` is used to determine whether we should record checkpoint offsets, since for non-persistent stores there are no data flushed to persistent storage and therefore we can only restore from beginning every time upon resuming. With this JIRA in-memory stores would be "persisted" to storage as well, the only difference is that it is not written until that "snapshotting" point is reached (e.g. after N commits). And hence in this case persistent() flag is not that useful anyways and may consider deprecated. > Add Checkpoint for In-memory State Store > > > Key: KAFKA-3184 > URL: https://issues.apache.org/jira/browse/KAFKA-3184 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: user-experience > > Currently Kafka Streams does not make a checkpoint of the persistent state > store upon committing, which would be expensive since it is "stopping the > world" and write on disks: for example, RocksDB would require you to copy the > file directory to make a copy naively. > However, for in-memory stores checkpointing maybe doable in an asynchronous > manner hence it can be done quickly. And the benefit of having intermediate > checkpoint is to avoid restoring from scratch if standby tasks are not > present. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read
[ https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736064#comment-16736064 ] Rajini Sivaram commented on KAFKA-7757: --- Thank you [~jnadler]! > Too many open files after java.io.IOException: Connection to n was > disconnected before the response was read > > > Key: KAFKA-7757 > URL: https://issues.apache.org/jira/browse/KAFKA-7757 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.0 >Reporter: Pedro Gontijo >Priority: Major > Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, > fd-spike-threads.txt, kafka-allocated-file-handles.png, server.properties, > td1.txt, td2.txt, td3.txt > > > We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers) > After a while (hours) 2 brokers start to throw: > {code:java} > java.io.IOException: Connection to NN was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) > at > kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > {code} > File descriptors start to pile up and if I do not restart it throws "Too many > open files" and crashes. > {code:java} > ERROR Error while accepting connection (kafka.network.Acceptor) > java.io.IOException: Too many open files in system > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:460) > at kafka.network.Acceptor.run(SocketServer.scala:403) > at java.lang.Thread.run(Thread.java:748) > {code} > > After some hours the issue happens again... It has happened with all > brokers, so it is not something specific to an instance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang resolved KAFKA-7768. -- Resolution: Fixed Fix Version/s: 2.1.1 2.2.0 > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > Fix For: 2.2.0, 2.1.1 > > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736072#comment-16736072 ] ASF GitHub Bot commented on KAFKA-7768: --- guozhangwang commented on pull request #6094: KAFKA-7768: Add version to java html urls URL: https://github.com/apache/kafka/pull/6094 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7768) Java doc link 404
[ https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736068#comment-16736068 ] ASF GitHub Bot commented on KAFKA-7768: --- guozhangwang commented on pull request #176: KAFKA-7768: Add version to java html urls URL: https://github.com/apache/kafka-site/pull/176 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java doc link 404 > -- > > Key: KAFKA-7768 > URL: https://issues.apache.org/jira/browse/KAFKA-7768 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.1.0 >Reporter: Slim Ouertani >Priority: Critical > > The official documentation link example > [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html] > (with no release reference) not referring to valid Java doc like > [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7743) "pure virtual method called" error message thrown for unit tests which PASS
[ https://issues.apache.org/jira/browse/KAFKA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736065#comment-16736065 ] Guozhang Wang commented on KAFKA-7743: -- I meant to have this parameter set to 1 (if I do not remember wrong by default it is 4), and see if the error happens on the same unit test case. The reason is that {{shouldThrowNullPointerIfValueSerdeIsNull}} should not have RocksDB instance at all and that makes me think that the sysout may be messed due to parallelism: the virtual function call error message may not be co-printed with the unit test it was captured. > "pure virtual method called" error message thrown for unit tests which PASS > --- > > Key: KAFKA-7743 > URL: https://issues.apache.org/jira/browse/KAFKA-7743 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Sarvesh Tamba >Priority: Critical > > Observing the following messages intermittently for a few random unit tests, > though the status for each of them is PASSED:- > *"pure virtual method called* > *terminate called without an active exception"* > Some of the unit tests throwing above messages are, besides others:- > org.apache.kafka.streams.kstream.internals.KTableImplTest > > shouldThrowNullPointerOnTransformValuesWithKeyWhenMaterializedIsNull PASSED > org.apache.kafka.streams.processor.internals.GlobalStreamThreadTest > > shouldThrowStreamsExceptionOnStartupIfThereIsAStreamsException PASSED > org.apache.kafka.streams.state.internals.CachingSessionStoreTest > > shouldNotForwardChangedValuesDuringFlushWhenSendOldValuesDisabled PASSED > org.apache.kafka.streams.processor.internals.StoreChangelogReaderTest > > shouldCompleteImmediatelyWhenEndOffsetIs0 PASSED > org.apache.kafka.streams.kstream.internals.KStreamKStreamJoinTest > testJoin > PASSED > org.apache.kafka.streams.kstream.internals.KTableFilterTest > > testTypeVariance PASSED > org.apache.kafka.streams.kstream.internals.KTableKTableInnerJoinTest > > testNotSendingOldValues PASSED > org.apache.kafka.streams.processor.internals.GlobalStreamThreadTest > > shouldThrowStreamsExceptionOnStartupIfThereIsAStreamsException PASSED > org.apache.kafka.streams.processor.DefaultPartitionGrouperTest > > shouldComputeGroupingForTwoGroups PASSED > org.apache.kafka.streams.state.internals.RocksDBWindowStoreTest > > shouldFetchAndIterateOverExactKeys PASSED > org.apache.kafka.streams.state.internals.FilteredCacheIteratorTest > > shouldFilterEntriesNotMatchingHasNextCondition PASSED > org.apache.kafka.streams.state.internals.GlobalStateStoreProviderTest > > shouldThrowExceptionIfStoreIsntOpen PASSED > This probably causes the 'gradle unitTest' command to fail during cleanup > time with final status as FAILED and the following message:- > "Process 'Gradle Test Executor 16' finished with non-zero exit value 134" > This intermittent/random error is not seen when final unit test suite status > is "BUILD SUCCESSFUL". > Reproducing "pure virtual method" issue is extremely hard, since it happens > intermittently and for any random unit test(not the same unit test will fail > next time). The ones noted above were some of the failing unit tests > observed. Note that the status next to the test shows PASSED(is this correct > or misleading?). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7791) Not log retriable exceptions as errors
Yaroslav Klymko created KAFKA-7791: -- Summary: Not log retriable exceptions as errors Key: KAFKA-7791 URL: https://issues.apache.org/jira/browse/KAFKA-7791 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.1.0 Reporter: Yaroslav Klymko Background: I've spotted tons of kafka related errors in logs, after investigation I found out that those are harmless as being retried. Hence I propose to not log retriable exceptions as errors. Examples of what I've see in logs: * Offset commit failed on partition .. at offset ..: The request timed out. * Offset commit failed on partition .. at offset ..: The coordinator is loading and hence can't process requests. * Offset commit failed on partition .. at offset ..: This is not the correct coordinator. * Offset commit failed on partition .. at offset ..: This server does not host this topic-partition. Here is attempt to fix this: https://github.com/apache/kafka/pull/5904 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read
[ https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736039#comment-16736039 ] Jeff Nadler commented on KAFKA-7757: [~rsivaram] please see attached file [^fd-spike-threads.txt] it does appear to be the exact same deadlock as you'd identified in KAFKA-7697. > Too many open files after java.io.IOException: Connection to n was > disconnected before the response was read > > > Key: KAFKA-7757 > URL: https://issues.apache.org/jira/browse/KAFKA-7757 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.0 >Reporter: Pedro Gontijo >Priority: Major > Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, > fd-spike-threads.txt, kafka-allocated-file-handles.png, server.properties, > td1.txt, td2.txt, td3.txt > > > We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers) > After a while (hours) 2 brokers start to throw: > {code:java} > java.io.IOException: Connection to NN was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) > at > kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > {code} > File descriptors start to pile up and if I do not restart it throws "Too many > open files" and crashes. > {code:java} > ERROR Error while accepting connection (kafka.network.Acceptor) > java.io.IOException: Too many open files in system > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:460) > at kafka.network.Acceptor.run(SocketServer.scala:403) > at java.lang.Thread.run(Thread.java:748) > {code} > > After some hours the issue happens again... It has happened with all > brokers, so it is not something specific to an instance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read
[ https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Nadler updated KAFKA-7757: --- Attachment: fd-spike-threads.txt > Too many open files after java.io.IOException: Connection to n was > disconnected before the response was read > > > Key: KAFKA-7757 > URL: https://issues.apache.org/jira/browse/KAFKA-7757 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.0 >Reporter: Pedro Gontijo >Priority: Major > Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, > fd-spike-threads.txt, kafka-allocated-file-handles.png, server.properties, > td1.txt, td2.txt, td3.txt > > > We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers) > After a while (hours) 2 brokers start to throw: > {code:java} > java.io.IOException: Connection to NN was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) > at > kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > {code} > File descriptors start to pile up and if I do not restart it throws "Too many > open files" and crashes. > {code:java} > ERROR Error while accepting connection (kafka.network.Acceptor) > java.io.IOException: Too many open files in system > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:460) > at kafka.network.Acceptor.run(SocketServer.scala:403) > at java.lang.Thread.run(Thread.java:748) > {code} > > After some hours the issue happens again... It has happened with all > brokers, so it is not something specific to an instance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7790) Trogdor - Does not time out tasks in time
[ https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislav Kozlovski updated KAFKA-7790: --- Description: All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions of task failure and restarts, it is intuitive to assume that a task would not be re-ran after a certain time period. Let's best illustrate the issue with an example: {code:java} startMs = 12PM; durationMs = 1hour; # 12:02 - Coordinator schedules a task to run on agent-0 # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that are not running in agent-0 # 13:20 - agent-0 process dies. # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} This can result in an endless loop of task rescheduling. If there are more tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where we overwhelm the agent with tasks that we would rather have dropped. h2. Changes We propose that the Trogdor Coordinator does not re-schedule a task if the current time of re-scheduling is greater than the start time of the task and its duration combined. More specifically: {code:java} if (currentTimeMs > startTimeMs + durationTimeMs) scheduleTask() else failTask(){code} was: All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions of task failure and restarts, it is intuitive to assume that a task would not be re-ran after a certain time period. Let's best illustrate the issue with an example: {code:java} startMs = 12PM; durationMs = 1hour; # 12:02 - Coordinator schedules a task to run on agent-0 # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that are not running in agent-0 # 13:20 - agent-0 process dies. # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} This can result in an endless loop of task rescheduling. If there are more tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where we overwhelm the agent with tasks that we would rather have dropped. h2. Changes We propose that the Trogdor Coordinator does not re-schedule a task if the current time of re-scheduling is greater than the start time of the task and its duration combined. More specifically: {code:java} if (currentTimeMs < startTimeMs + durationTimeMs) scheduleTask() else failTask(){code} > Trogdor - Does not time out tasks in time > - > > Key: KAFKA-7790 > URL: https://issues.apache.org/jira/browse/KAFKA-7790 > Project: Kafka > Issue Type: Improvement >Reporter: Stanislav Kozlovski >Assignee: Stanislav Kozlovski >Priority: Major > > All Trogdor task specifications have a defined `startMs` and `durationMs`. > Under conditions of task failure and restarts, it is intuitive to assume that > a task would not be re-ran after a certain time period. > Let's best illustrate the issue with an example: > {code:java} > startMs = 12PM; durationMs = 1hour; > # 12:02 - Coordinator schedules a task to run on agent-0 > # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. > # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it > re-schedules tasks that are not running in agent-0 > # 13:20 - agent-0 process dies. > # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} > This can result in an endless loop of task rescheduling. If there are more > tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we > can end up in a scenario where we overwhelm the agent with tasks that we > would rather have dropped. > h2. Changes > We propose that the Trogdor Coordinator does not re-schedule a task if the > current time of re-scheduling is greater than the start time of the task and > its duration combined. More specifically: > {code:java} > if (currentTimeMs > startTimeMs + durationTimeMs) > scheduleTask() > else > failTask(){code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time
[ https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Loïc Monney updated KAFKA-7755: --- Affects Version/s: (was: 2.1.1) (was: 2.2.0) > Kubernetes - Kafka clients are resolving DNS entries only one time > -- > > Key: KAFKA-7755 > URL: https://issues.apache.org/jira/browse/KAFKA-7755 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 > Environment: Kubernetes >Reporter: Loïc Monney >Priority: Blocker > Attachments: pom.xml > > > *Introduction* > Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses > if the first one fails. This change has been introduced by > https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution > is now performed only one time by the clients. This is not a problem if all > brokers have fixed IP addresses, however this is definitely an issue when > Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will > receive another IP address, so as soon as all brokers will have been > restarted clients won't be able to reconnect to any broker. > *Impact* > Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a > rolling restart is performed. > *Root cause* > Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are > resolving DNS entries only once. > *Proposed solution* > In > [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368] > Kafka clients should perform the DNS resolution again when all IP addresses > have been "used" (when _index_ is back to 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7695) Cannot override StreamsPartitionAssignor in configuration
[ https://issues.apache.org/jira/browse/KAFKA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Buykin resolved KAFKA-7695. -- Resolution: Not A Bug > Cannot override StreamsPartitionAssignor in configuration > -- > > Key: KAFKA-7695 > URL: https://issues.apache.org/jira/browse/KAFKA-7695 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dmitry Buykin >Priority: Major > Labels: configuration > > Cannot override StreamsPartitionAssignor by changing property > partition.assignment.strategy in KStreams 2.0.1 because the streams are > crashing inside KafkaClientSupplier.getGlobalConsumer. This GlobalConsumer > works only with RangeAssignor which configured by default. > Could be reproduced by setting up > `props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, > StreamsPartitionAssignor.class.getName());` > For me it looks like a bug. > Opened a discussion here > https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1543395977453700 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7789) SSL-related unit tests hang when run on Fedora 29
[ https://issues.apache.org/jira/browse/KAFKA-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735827#comment-16735827 ] ASF GitHub Bot commented on KAFKA-7789: --- tombentley commented on pull request #6096: KAFKA-7789: Increase size of RSA keys used by TestSslUtils URL: https://github.com/apache/kafka/pull/6096 Fix [KAFKA-7789](https://issues.apache.org/jira/browse/KAFKA-7789) by increasing the key size for the RSA keys generated for the tests. ### Committer Checklist (excluded from commit message) - [x] Verify design and implementation - [x] Verify test coverage and CI build status - [x] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SSL-related unit tests hang when run on Fedora 29 > - > > Key: KAFKA-7789 > URL: https://issues.apache.org/jira/browse/KAFKA-7789 > Project: Kafka > Issue Type: Bug >Reporter: Tom Bentley >Assignee: Tom Bentley >Priority: Minor > > Various SSL-related unit tests (such as {{SslSelectorTest}}) hang when > executed on Fedora 29. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7695) Cannot override StreamsPartitionAssignor in configuration
[ https://issues.apache.org/jira/browse/KAFKA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735816#comment-16735816 ] Dmitry Buykin commented on KAFKA-7695: -- Hi [~vvcephei], Yes, I'm agree that this task should be closed because become too wide. Regards, Dmitry. > Cannot override StreamsPartitionAssignor in configuration > -- > > Key: KAFKA-7695 > URL: https://issues.apache.org/jira/browse/KAFKA-7695 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dmitry Buykin >Priority: Major > Labels: configuration > > Cannot override StreamsPartitionAssignor by changing property > partition.assignment.strategy in KStreams 2.0.1 because the streams are > crashing inside KafkaClientSupplier.getGlobalConsumer. This GlobalConsumer > works only with RangeAssignor which configured by default. > Could be reproduced by setting up > `props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, > StreamsPartitionAssignor.class.getName());` > For me it looks like a bug. > Opened a discussion here > https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1543395977453700 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time
[ https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735806#comment-16735806 ] ASF GitHub Bot commented on KAFKA-7755: --- rajinisivaram commented on pull request #6049: KAFKA-7755 Turn update inet addresses URL: https://github.com/apache/kafka/pull/6049 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kubernetes - Kafka clients are resolving DNS entries only one time > -- > > Key: KAFKA-7755 > URL: https://issues.apache.org/jira/browse/KAFKA-7755 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0, 2.2.0, 2.1.1 > Environment: Kubernetes >Reporter: Loïc Monney >Priority: Blocker > Attachments: pom.xml > > > *Introduction* > Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses > if the first one fails. This change has been introduced by > https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution > is now performed only one time by the clients. This is not a problem if all > brokers have fixed IP addresses, however this is definitely an issue when > Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will > receive another IP address, so as soon as all brokers will have been > restarted clients won't be able to reconnect to any broker. > *Impact* > Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a > rolling restart is performed. > *Root cause* > Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are > resolving DNS entries only once. > *Proposed solution* > In > [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368] > Kafka clients should perform the DNS resolution again when all IP addresses > have been "used" (when _index_ is back to 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7790) Trogdor - Does not time out tasks in time
Stanislav Kozlovski created KAFKA-7790: -- Summary: Trogdor - Does not time out tasks in time Key: KAFKA-7790 URL: https://issues.apache.org/jira/browse/KAFKA-7790 Project: Kafka Issue Type: Improvement Reporter: Stanislav Kozlovski Assignee: Stanislav Kozlovski All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions of task failure and restarts, it is intuitive to assume that a task would not be re-ran after a certain time period. Let's best illustrate the issue with an example: {code:java} startMs = 12PM; durationMs = 1hour; # 12:02 - Coordinator schedules a task to run on agent-0 # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that are not running in agent-0 # 13:20 - agent-0 process dies. # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} This can result in an endless loop of task rescheduling. If there are more tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where we overwhelm the agent with tasks that we would rather have dropped. h2. Changes We propose that the Trogdor Coordinator does not re-schedule a task if the current time of re-scheduling is greater than the start time of the task and its duration combined. More specifically: {code:java} if (currentTimeMs < startTimeMs + durationTimeMs) scheduleTask() else failTask(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7361) Kafka wont reconnect after NoRouteToHostException
[ https://issues.apache.org/jira/browse/KAFKA-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735755#comment-16735755 ] Igor Soarez commented on KAFKA-7361: https://issues.apache.org/jira/browse/KAFKA-7755 seems to be addressing this > Kafka wont reconnect after NoRouteToHostException > - > > Key: KAFKA-7361 > URL: https://issues.apache.org/jira/browse/KAFKA-7361 > Project: Kafka > Issue Type: Bug > Components: zkclient >Affects Versions: 1.1.0 > Environment: kubernetes cluster >Reporter: C Schabert >Priority: Major > > After Zookeeper died and came back up kafka could not reconnect to zookeeper. > In this Setup zookeeper ist behind a dns and came up with a different ip. > > Here is the kafka log output: > > {code:java} > [2018-08-30 14:50:23,846] INFO Opening socket connection to server > zookeeper-0.zookeeper./10.42.0.123:2181. Will not attempt to authenticate > using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) > 8/30/2018 4:50:26 PM [2018-08-30 14:50:26,916] WARN Session 0x1658b2f0f4e0002 > for server null, unexpected error, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > 8/30/2018 4:50:26 PM java.net.NoRouteToHostException: No route to host > 8/30/2018 4:50:26 PM at sun.nio.ch.SocketChannelImpl.checkConnect(Native > Method) > 8/30/2018 4:50:26 PM at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > 8/30/2018 4:50:26 PM at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > 8/30/2018 4:50:26 PM at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7789) SSL-related unit tests hang when run on Fedora 29
[ https://issues.apache.org/jira/browse/KAFKA-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735729#comment-16735729 ] Tom Bentley commented on KAFKA-7789: This is caused by Fedora tightening up its system-wide crypto policies, as described here: https://fedoraproject.org/wiki/Changes/StrongCryptoSettings2. Their changes to {{/etc/crypto-policies/back-ends/java.config}} set {{jdk.certpath.disabledAlgorithms=MD2, MD5, DSA, RSA keySize < 2048}} thus causing the KeyManager to reject RSA keys with size < 2048bits. The rejection of the keys happens silently unless {{-Djavax.net.debug=ssl,handshake,keymanager}} system property is set. The {{TestSslUtils}} generates its keys with 1024 bit keys. Fedora 29 users can change the policy to LEGACY with {{update-crypto-policies --set LEGACY}} as root, but this enables the LEGACY algorithm support system-wide. The better option would be to update the unit tests to use 2048 bit keys. > SSL-related unit tests hang when run on Fedora 29 > - > > Key: KAFKA-7789 > URL: https://issues.apache.org/jira/browse/KAFKA-7789 > Project: Kafka > Issue Type: Bug >Reporter: Tom Bentley >Assignee: Tom Bentley >Priority: Minor > > Various SSL-related unit tests (such as {{SslSelectorTest}}) hang when > executed on Fedora 29. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7789) SSL-related unit tests hang when run on Fedora 29
Tom Bentley created KAFKA-7789: -- Summary: SSL-related unit tests hang when run on Fedora 29 Key: KAFKA-7789 URL: https://issues.apache.org/jira/browse/KAFKA-7789 Project: Kafka Issue Type: Bug Reporter: Tom Bentley Assignee: Tom Bentley Various SSL-related unit tests (such as {{SslSelectorTest}}) hang when executed on Fedora 29. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read
[ https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735600#comment-16735600 ] Rajini Sivaram commented on KAFKA-7757: --- This could be due to KAFKA-7697. Can you get a thread dump of a broker when the file descriptors are going up and attach to the JIRA? Thank you! > Too many open files after java.io.IOException: Connection to n was > disconnected before the response was read > > > Key: KAFKA-7757 > URL: https://issues.apache.org/jira/browse/KAFKA-7757 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.0 >Reporter: Pedro Gontijo >Priority: Major > Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, > kafka-allocated-file-handles.png, server.properties, td1.txt, td2.txt, td3.txt > > > We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers) > After a while (hours) 2 brokers start to throw: > {code:java} > java.io.IOException: Connection to NN was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) > at > kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > {code} > File descriptors start to pile up and if I do not restart it throws "Too many > open files" and crashes. > {code:java} > ERROR Error while accepting connection (kafka.network.Acceptor) > java.io.IOException: Too many open files in system > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:460) > at kafka.network.Acceptor.run(SocketServer.scala:403) > at java.lang.Thread.run(Thread.java:748) > {code} > > After some hours the issue happens again... It has happened with all > brokers, so it is not something specific to an instance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)