[jira] [Commented] (KAFKA-7768) Java doc link 404

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736529#comment-16736529
 ] 

ASF GitHub Bot commented on KAFKA-7768:
---

guozhangwang commented on pull request #178: KAFKA-7768: Use absolute paths for 
javadoc
URL: https://github.com/apache/kafka-site/pull/178
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
> Fix For: 2.2.0, 2.1.1
>
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7768) Java doc link 404

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736423#comment-16736423
 ] 

ASF GitHub Bot commented on KAFKA-7768:
---

vvcephei commented on pull request #178: KAFKA-7768: Use absolute paths for 
javadoc
URL: https://github.com/apache/kafka-site/pull/178
 
 
   The breakage observed in KAFKA-7768 was only when following a link from a 
page with a non-versioned URL.
   
   When starting from a versioned url, the number of ../ up-paths was correct, 
so inserting the version actually breaks it. To avoid this situation, this PR 
switches to using absolute paths in conjunction with the version number, as 
other links in the docs do.
   
   See also #176, https://github.com/apache/kafka/pull/6094, and 
https://github.com/apache/kafka/pull/6100
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
> Fix For: 2.2.0, 2.1.1
>
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7786) Fast update of leader epoch may stall partition fetching due to FENCED_LEADER_EPOCH

2019-01-07 Thread Anna Povzner (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736420#comment-16736420
 ] 

Anna Povzner commented on KAFKA-7786:
-

Hi [~junrao], thanks for the comment. I recently talked to [~hachikuji] and he 
suggested the same fix, with no protocol change. I just opened a PR: 
[https://github.com/apache/kafka/pull/6101.] 

> Fast update of leader epoch may stall partition fetching due to 
> FENCED_LEADER_EPOCH
> ---
>
> Key: KAFKA-7786
> URL: https://issues.apache.org/jira/browse/KAFKA-7786
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Anna Povzner
>Assignee: Anna Povzner
>Priority: Critical
>
> KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a 
> OffsetsForLeaderEpoch request if the epoch in the request is lower than the 
> broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch 
> request under _partitionMapLock_, sends the request outside the lock, and 
> then processes the response under _partitionMapLock_. The broker may receive 
> LeaderAndIsr with the same leader but with the next leader epoch, remove and 
> add partition to the fetcher thread (with partition state reflecting the 
> updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the 
> old leader epoch) is still outstanding/ waiting for the lock to process the 
> OffsetsForLeaderEpoch response. As a result, partition gets removed from 
> partitionStates and this broker will not fetch for this partition until the 
> next LeaderAndIsr which may take a while. We will see log message like this:
> [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current 
> leader. Will await the new LeaderAndIsr state before resuming fetching. 
> (kafka.server.ReplicaFetcherThread)
> We saw this happen with 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True.
>  This test does partition re-assignment while bouncing 2 out of 4 total 
> brokers. When the failure happen, each bounced broker was also a controller. 
> Because of re-assignment, the controller updates leader epoch without 
> updating the leader on controller change or on broker startup, so we see 
> several leader epoch changes without the leader change, which increases the 
> likelihood of the race condition described above.
> Here is exact events that happen in this test (around the failure):
> We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for 
> test_topic-17 [2, 4, 1]  —> [3, 1, 2]. At time t0, leader of test_topic-17 is 
> broker 2.
>  # clean shutdown of broker 3, which is also a controller
>  # broker 4 becomes controller, continues re-assignment and updates leader 
> epoch for test_topic-17 to 6 (with same leader)
>  # broker 2 (leader of test_topic-17) receives new leader epoch: 
> “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader 
> Epoch was: 5”
>  # broker 3 is started again after clean shutdown
>  # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) 
> to broker 3
>  # controller updates leader epoch to 7
>  # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader epoch 
> 7: “test_topic-17 starts at Leader Epoch 7 from offset 1974. Previous Leader 
> Epoch was: 6”
>  # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 6 from 
> controller: “Added fetcher to broker BrokerEndPoint(id=2) for leader epoch 6” 
> and sends OffsetsForLeaderEpoch request to broker 2
>  # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 7 from 
> controller; removes fetcher thread and adds fetcher thread + executes 
> AbstractFetcherThread.addPartitions() which updates partition state with 
> leader epoch 7
>  # broker 3 receives FENCED_LEADER_EPOCH in response to 
> OffsetsForLeaderEpoch(leader epoch 6), because the leader received 
> LeaderAndIsr for leader epoch 7 before it got OffsetsForLeaderEpoch(leader 
> epoch 6) from broker 3. As a result, it removes partition from 
> partitionStates and it does not fetch until controller updates leader epoch 
> and sends LeaderAndIsr for this partition to broker 3. The test fails, 
> because re-assignment does not finish on time (due to broker 3 not fetching).
>  
> One way to address this is possibly add more state to PartitionFetchState. 
> However, we may introduce other race condition. A cleaner way, I think, is to 
> return leader epoch in the OffsetsForLeaderEpoch response with 
> FENCED_LEADER_EPOCH error, and then ignore the error if partition state 
> contains a higher leader epoch. The adv

[jira] [Commented] (KAFKA-7786) Fast update of leader epoch may stall partition fetching due to FENCED_LEADER_EPOCH

2019-01-07 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736415#comment-16736415
 ] 

Jun Rao commented on KAFKA-7786:


Great find, Anna. About the fix. I am not sure that we need the protocol 
change. We know the leader epoch used in OffsetsForLeaderEpoch request. We can 
just pass that along with the OffsetsForLeaderEpoch response to 
maybeTruncate(). If the leaderEpoch in partitionStates has changed, we simply 
ignore the response and retry the OffsetsForLeaderEpoch request with the new 
leader epoch.

 

> Fast update of leader epoch may stall partition fetching due to 
> FENCED_LEADER_EPOCH
> ---
>
> Key: KAFKA-7786
> URL: https://issues.apache.org/jira/browse/KAFKA-7786
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Anna Povzner
>Assignee: Anna Povzner
>Priority: Critical
>
> KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a 
> OffsetsForLeaderEpoch request if the epoch in the request is lower than the 
> broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch 
> request under _partitionMapLock_, sends the request outside the lock, and 
> then processes the response under _partitionMapLock_. The broker may receive 
> LeaderAndIsr with the same leader but with the next leader epoch, remove and 
> add partition to the fetcher thread (with partition state reflecting the 
> updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the 
> old leader epoch) is still outstanding/ waiting for the lock to process the 
> OffsetsForLeaderEpoch response. As a result, partition gets removed from 
> partitionStates and this broker will not fetch for this partition until the 
> next LeaderAndIsr which may take a while. We will see log message like this:
> [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current 
> leader. Will await the new LeaderAndIsr state before resuming fetching. 
> (kafka.server.ReplicaFetcherThread)
> We saw this happen with 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True.
>  This test does partition re-assignment while bouncing 2 out of 4 total 
> brokers. When the failure happen, each bounced broker was also a controller. 
> Because of re-assignment, the controller updates leader epoch without 
> updating the leader on controller change or on broker startup, so we see 
> several leader epoch changes without the leader change, which increases the 
> likelihood of the race condition described above.
> Here is exact events that happen in this test (around the failure):
> We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for 
> test_topic-17 [2, 4, 1]  —> [3, 1, 2]. At time t0, leader of test_topic-17 is 
> broker 2.
>  # clean shutdown of broker 3, which is also a controller
>  # broker 4 becomes controller, continues re-assignment and updates leader 
> epoch for test_topic-17 to 6 (with same leader)
>  # broker 2 (leader of test_topic-17) receives new leader epoch: 
> “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader 
> Epoch was: 5”
>  # broker 3 is started again after clean shutdown
>  # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) 
> to broker 3
>  # controller updates leader epoch to 7
>  # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader epoch 
> 7: “test_topic-17 starts at Leader Epoch 7 from offset 1974. Previous Leader 
> Epoch was: 6”
>  # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 6 from 
> controller: “Added fetcher to broker BrokerEndPoint(id=2) for leader epoch 6” 
> and sends OffsetsForLeaderEpoch request to broker 2
>  # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 7 from 
> controller; removes fetcher thread and adds fetcher thread + executes 
> AbstractFetcherThread.addPartitions() which updates partition state with 
> leader epoch 7
>  # broker 3 receives FENCED_LEADER_EPOCH in response to 
> OffsetsForLeaderEpoch(leader epoch 6), because the leader received 
> LeaderAndIsr for leader epoch 7 before it got OffsetsForLeaderEpoch(leader 
> epoch 6) from broker 3. As a result, it removes partition from 
> partitionStates and it does not fetch until controller updates leader epoch 
> and sends LeaderAndIsr for this partition to broker 3. The test fails, 
> because re-assignment does not finish on time (due to broker 3 not fetching).
>  
> One way to address this is possibly add more state to PartitionFetchState. 
> However, we may introduce other race condition. A cleaner way, I think, is

[jira] [Commented] (KAFKA-7786) Fast update of leader epoch may stall partition fetching due to FENCED_LEADER_EPOCH

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736416#comment-16736416
 ] 

ASF GitHub Bot commented on KAFKA-7786:
---

apovzner commented on pull request #6101: KAFKA-7786: Ignore 
OffsetsForLeaderEpoch response if leader epoch changed while request in flight
URL: https://github.com/apache/kafka/pull/6101
 
 
   There is a race condition in ReplicaFetcherThread, where we can update 
PartitionFetchState with the new leader epoch (same leader) before handling the 
OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes 
removing partition from partitionStates, which in turn causes no fetching until 
the next LeaderAndIsr. 
   
   Our system test 
kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True
 failed 3 times due to this error in the last couple of months. Since this test 
is already able to test this condition, not adding any more tests.
   
   Also added toString() implementation to PartitionData, because some log 
messages did not show useful info which I found while investigating the above 
system test failure.
   
   cc @hachikuji who suggested the fix.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fast update of leader epoch may stall partition fetching due to 
> FENCED_LEADER_EPOCH
> ---
>
> Key: KAFKA-7786
> URL: https://issues.apache.org/jira/browse/KAFKA-7786
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Anna Povzner
>Assignee: Anna Povzner
>Priority: Critical
>
> KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a 
> OffsetsForLeaderEpoch request if the epoch in the request is lower than the 
> broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch 
> request under _partitionMapLock_, sends the request outside the lock, and 
> then processes the response under _partitionMapLock_. The broker may receive 
> LeaderAndIsr with the same leader but with the next leader epoch, remove and 
> add partition to the fetcher thread (with partition state reflecting the 
> updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the 
> old leader epoch) is still outstanding/ waiting for the lock to process the 
> OffsetsForLeaderEpoch response. As a result, partition gets removed from 
> partitionStates and this broker will not fetch for this partition until the 
> next LeaderAndIsr which may take a while. We will see log message like this:
> [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current 
> leader. Will await the new LeaderAndIsr state before resuming fetching. 
> (kafka.server.ReplicaFetcherThread)
> We saw this happen with 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True.
>  This test does partition re-assignment while bouncing 2 out of 4 total 
> brokers. When the failure happen, each bounced broker was also a controller. 
> Because of re-assignment, the controller updates leader epoch without 
> updating the leader on controller change or on broker startup, so we see 
> several leader epoch changes without the leader change, which increases the 
> likelihood of the race condition described above.
> Here is exact events that happen in this test (around the failure):
> We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for 
> test_topic-17 [2, 4, 1]  —> [3, 1, 2]. At time t0, leader of test_topic-17 is 
> broker 2.
>  # clean shutdown of broker 3, which is also a controller
>  # broker 4 becomes controller, continues re-assignment and updates leader 
> epoch for test_topic-17 to 6 (with same leader)
>  # broker 2 (leader of test_topic-17) receives new leader epoch: 
> “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader 
> Epoch was: 5”
>  # broker 3 is started again after clean shutdown
>  # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) 
> to broker 3
>  # controller updates leader epoch to 7
>  # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader ep

[jira] [Commented] (KAFKA-6460) Add mocks for state stores used in Streams unit testing

2019-01-07 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736406#comment-16736406
 ] 

Guozhang Wang commented on KAFKA-6460:
--

Great! Feel free to look at KIP-267 for the context of this: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-267%3A+Add+Processor+Unit+Test+Support+to+Kafka+Streams+Test+Utils
 .

> Add mocks for state stores used in Streams unit testing
> ---
>
> Key: KAFKA-6460
> URL: https://issues.apache.org/jira/browse/KAFKA-6460
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: newbie++
>
> We'd like to use mocks for different types of state stores: kv, window, 
> session that can be used to record the number of expected put / get calls 
> used in the DSL operator unit testing. This involves implementing the two 
> interfaces {{StoreSupplier}} and {{StoreBuilder}} that can return a object 
> created from, say, EasyMock, and the object can then be set up with the 
> expected calls.
> In addition, we should also add a mock record collector which can be returned 
> from the mock processor context so that with logging enabled store, users can 
> also validate if the changes have been forwarded to the changelog as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6460) Add mocks for state stores used in Streams unit testing

2019-01-07 Thread Yishun Guan (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736402#comment-16736402
 ] 

Yishun Guan commented on KAFKA-6460:


Hi [~guozhang] , I am still interested if nobody else is looking into this. 
Will take some time since I am not familiar with the test packages, but I can 
definitely do some researches and take a stab at it. 

> Add mocks for state stores used in Streams unit testing
> ---
>
> Key: KAFKA-6460
> URL: https://issues.apache.org/jira/browse/KAFKA-6460
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: newbie++
>
> We'd like to use mocks for different types of state stores: kv, window, 
> session that can be used to record the number of expected put / get calls 
> used in the DSL operator unit testing. This involves implementing the two 
> interfaces {{StoreSupplier}} and {{StoreBuilder}} that can return a object 
> created from, say, EasyMock, and the object can then be set up with the 
> expected calls.
> In addition, we should also add a mock record collector which can be returned 
> from the mock processor context so that with logging enabled store, users can 
> also validate if the changes have been forwarded to the changelog as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7793) Improve the Trogdor command-line

2019-01-07 Thread Colin P. McCabe (JIRA)
Colin P. McCabe created KAFKA-7793:
--

 Summary: Improve the Trogdor command-line
 Key: KAFKA-7793
 URL: https://issues.apache.org/jira/browse/KAFKA-7793
 Project: Kafka
  Issue Type: Improvement
Reporter: Colin P. McCabe


Improve the Trogdor command-line.  It should be easier to launch tasks from a 
task spec in a file.  It should be easier to list the currently-running tasks 
in a readable way.  We should be able to filter the currently-running tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6460) Add mocks for state stores used in Streams unit testing

2019-01-07 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736399#comment-16736399
 ] 

Guozhang Wang commented on KAFKA-6460:
--

It should be part of the public test-utils package, and this ticket also 
involves 1) adding a {{StoreTestDriver}} which can provide mock 
{{StoreSupplier}}s and {{StoreBuilder}}s, which can provide different mocks of 
{{StateStore}}s; 2) and these state stores should be using the 
{{MockProcessorContext}} existed in the test-utils already, and 3) refactoring 
the internal unit tests of Streams to leverage on the newly added state store / 
supplier / record collector and remove any of these vanilla ones (e.g. 
{{KeyValueStoreTestDriver}}).

[~shung] Are you interested in picking this up?

> Add mocks for state stores used in Streams unit testing
> ---
>
> Key: KAFKA-6460
> URL: https://issues.apache.org/jira/browse/KAFKA-6460
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: newbie++
>
> We'd like to use mocks for different types of state stores: kv, window, 
> session that can be used to record the number of expected put / get calls 
> used in the DSL operator unit testing. This involves implementing the two 
> interfaces {{StoreSupplier}} and {{StoreBuilder}} that can return a object 
> created from, say, EasyMock, and the object can then be set up with the 
> expected calls.
> In addition, we should also add a mock record collector which can be returned 
> from the mock processor context so that with logging enabled store, users can 
> also validate if the changes have been forwarded to the changelog as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-4218) Enable access to key in ValueTransformer

2019-01-07 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-4218.
--
   Resolution: Fixed
Fix Version/s: (was: 1.1.0)
   2.1.0

This has been resolved by 
https://github.com/apache/kafka/commit/bcc712b45820da74b44209857ebbf7b9d59e0ed7 
from [~jeyhunkarimov]


> Enable access to key in ValueTransformer
> 
>
> Key: KAFKA-4218
> URL: https://issues.apache.org/jira/browse/KAFKA-4218
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.10.0.1
>Reporter: Elias Levy
>Assignee: Jeyhun Karimov
>Priority: Major
>  Labels: api, kip
> Fix For: 2.1.0
>
>
> While transforming values via {{KStream.transformValues}} and 
> {{ValueTransformer}}, the key associated with the value may be needed, even 
> if it is not changed.  For instance, it may be used to access stores.  
> As of now, the key is not available within these methods and interfaces, 
> leading to the use of {{KStream.transform}} and {{Transformer}}, and the 
> unnecessary creation of new {{KeyValue}} objects.
> KIP-149: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-149%3A+Enabling+key+access+in+ValueTransformer%2C+ValueMapper%2C+and+ValueJoiner



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7768) Java doc link 404

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736382#comment-16736382
 ] 

ASF GitHub Bot commented on KAFKA-7768:
---

guozhangwang commented on pull request #6100: KAFKA-7768: Use absolute paths 
for javadoc
URL: https://github.com/apache/kafka/pull/6100
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
> Fix For: 2.2.0, 2.1.1
>
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-3729) Auto-configure non-default SerDes passed alongside the topology builder

2019-01-07 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang reassigned KAFKA-3729:


Assignee: (was: Bharat Viswanadham)

>  Auto-configure non-default SerDes passed alongside the topology builder
> 
>
> Key: KAFKA-3729
> URL: https://issues.apache.org/jira/browse/KAFKA-3729
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Fred Patton
>Priority: Major
>  Labels: api, newbie
>
> From Guozhang Wang:
> "Only default serdes provided through configs are auto-configured today. But 
> we could auto-configure other serdes passed alongside the topology builder as 
> well."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4468) Correctly calculate the window end timestamp after read from state stores

2019-01-07 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-4468:
-
Fix Version/s: 2.2.0

> Correctly calculate the window end timestamp after read from state stores
> -
>
> Key: KAFKA-4468
> URL: https://issues.apache.org/jira/browse/KAFKA-4468
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Richard Yu
>Priority: Major
>  Labels: architecture
> Fix For: 2.2.0
>
>
> When storing the WindowedStore on the persistent KV store, we only use the 
> start timestamp of the window as part of the combo-key as (start-timestamp, 
> key). The reason that we do not add the end-timestamp as well is that we can 
> always calculate it from the start timestamp + window_length, and hence we 
> can save 8 bytes per key on the persistent KV store.
> However, after read it (via {{WindowedDeserializer}}) we do not set its end 
> timestamp correctly but just read it as an {{UnlimitedWindow}}. We should fix 
> this by calculating its end timestamp as mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-4468) Correctly calculate the window end timestamp after read from state stores

2019-01-07 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736369#comment-16736369
 ] 

Guozhang Wang commented on KAFKA-4468:
--

With 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-393%3A+Time+windowed+serde+to+properly+deserialize+changelog+input+topic
 merged in, we can finally resolve this issue now.

> Correctly calculate the window end timestamp after read from state stores
> -
>
> Key: KAFKA-4468
> URL: https://issues.apache.org/jira/browse/KAFKA-4468
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: architecture
>
> When storing the WindowedStore on the persistent KV store, we only use the 
> start timestamp of the window as part of the combo-key as (start-timestamp, 
> key). The reason that we do not add the end-timestamp as well is that we can 
> always calculate it from the start timestamp + window_length, and hence we 
> can save 8 bytes per key on the persistent KV store.
> However, after read it (via {{WindowedDeserializer}}) we do not set its end 
> timestamp correctly but just read it as an {{UnlimitedWindow}}. We should fix 
> this by calculating its end timestamp as mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-4468) Correctly calculate the window end timestamp after read from state stores

2019-01-07 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-4468.
--
Resolution: Fixed
  Assignee: Richard Yu

> Correctly calculate the window end timestamp after read from state stores
> -
>
> Key: KAFKA-4468
> URL: https://issues.apache.org/jira/browse/KAFKA-4468
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Richard Yu
>Priority: Major
>  Labels: architecture
>
> When storing the WindowedStore on the persistent KV store, we only use the 
> start timestamp of the window as part of the combo-key as (start-timestamp, 
> key). The reason that we do not add the end-timestamp as well is that we can 
> always calculate it from the start timestamp + window_length, and hence we 
> can save 8 bytes per key on the persistent KV store.
> However, after read it (via {{WindowedDeserializer}}) we do not set its end 
> timestamp correctly but just read it as an {{UnlimitedWindow}}. We should fix 
> this by calculating its end timestamp as mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7051) Improve the efficiency of the ReplicaManager when there are many partitions

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736367#comment-16736367
 ] 

ASF GitHub Bot commented on KAFKA-7051:
---

cmccabe commented on pull request #5206: KAFKA-7051: Improve the efficiency of 
ReplicaManager
URL: https://github.com/apache/kafka/pull/5206
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve the efficiency of the ReplicaManager when there are many partitions
> ---
>
> Key: KAFKA-7051
> URL: https://issues.apache.org/jira/browse/KAFKA-7051
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 0.8.0
>Reporter: Colin P. McCabe
>Assignee: Colin P. McCabe
>Priority: Major
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7699) Improve wall-clock time punctuations

2019-01-07 Thread Yishun Guan (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736364#comment-16736364
 ] 

Yishun Guan commented on KAFKA-7699:


Hi, just curious (I am not an expert in this), why won't 
_ScheduledExecutorService_ works __ in this case? thanks!

> Improve wall-clock time punctuations
> 
>
> Key: KAFKA-7699
> URL: https://issues.apache.org/jira/browse/KAFKA-7699
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>  Labels: needs-kip
>
> Currently, wall-clock time punctuation allow to schedule periodic call backs 
> based on wall-clock time progress. The punctuation time starts, when the 
> punctuation is scheduled, thus, it's non-deterministic what is desired for 
> many use cases (I want a call-back in 5 minutes from "now").
> It would be a nice improvement, to allow users to "anchor" wall-clock 
> punctation, too, similar to a cron job: Thus, a punctuation would be 
> triggered at "fixed" times like the beginning of the next hour, independent 
> when the punctuation was registered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7699) Improve wall-clock time punctuations

2019-01-07 Thread Yishun Guan (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736364#comment-16736364
 ] 

Yishun Guan edited comment on KAFKA-7699 at 1/7/19 9:16 PM:


Hi, just curious (I am not an expert in this), why won't 
_ScheduledExecutorService_ works in this case? thanks!


was (Author: shung):
Hi, just curious (I am not an expert in this), why won't 
_ScheduledExecutorService_ works __ in this case? thanks!

> Improve wall-clock time punctuations
> 
>
> Key: KAFKA-7699
> URL: https://issues.apache.org/jira/browse/KAFKA-7699
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>  Labels: needs-kip
>
> Currently, wall-clock time punctuation allow to schedule periodic call backs 
> based on wall-clock time progress. The punctuation time starts, when the 
> punctuation is scheduled, thus, it's non-deterministic what is desired for 
> many use cases (I want a call-back in 5 minutes from "now").
> It would be a nice improvement, to allow users to "anchor" wall-clock 
> punctation, too, similar to a cron job: Thus, a punctuation would be 
> triggered at "fixed" times like the beginning of the next hour, independent 
> when the punctuation was registered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7768) Java doc link 404

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736336#comment-16736336
 ] 

ASF GitHub Bot commented on KAFKA-7768:
---

vvcephei commented on pull request #6100: KAFKA-7768: Use absolute paths for 
javadoc
URL: https://github.com/apache/kafka/pull/6100
 
 
   The breakage observed in KAFKA-7768 was only when following a link from a 
page with a non-versioned URL.
   
   When starting from  a versioned url, the number of `../` up-paths was 
correct, so inserting the version actually breaks it. To avoid this situation, 
this PR switches to using absolute paths in conjunction with the version 
number, as other links in the docs do.
   
   See also #6094
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
> Fix For: 2.2.0, 2.1.1
>
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7102) Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20

2019-01-07 Thread Randall Hauch (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch updated KAFKA-7102:
-
Component/s: (was: KafkaConnect)
 core

>  Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20
> ---
>
> Key: KAFKA-7102
> URL: https://issues.apache.org/jira/browse/KAFKA-7102
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
>Reporter: sankar
>Priority: Critical
> Attachments: kafka_java_io_exception.txt
>
>
> we faced  Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20
>  Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@10c8df20
> We have four node kafka cluster in production environment. We experienced 
> suddenly kafka connect issue across cluster. 
> manual restart kafka service on all the nodes fixed the issue.
> I attached the complete log. Please check the log.
> kindly let me know what information more needed from my side.
> Thanks in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7792) Trogdor should have an uptime function

2019-01-07 Thread Colin P. McCabe (JIRA)
Colin P. McCabe created KAFKA-7792:
--

 Summary: Trogdor should have an uptime function
 Key: KAFKA-7792
 URL: https://issues.apache.org/jira/browse/KAFKA-7792
 Project: Kafka
  Issue Type: Improvement
Reporter: Colin P. McCabe
Assignee: Colin P. McCabe


Trogdor should have an uptime function which returns how long the coordinator 
or agent has been up.  This will also be a good way to test that the daemon is 
running without fetching a full status.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7160) Add check for group ID length

2019-01-07 Thread roo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736238#comment-16736238
 ] 

roo commented on KAFKA-7160:


PR created no 6098, but I can't assign the issue to myself 

> Add check for group ID length
> -
>
> Key: KAFKA-7160
> URL: https://issues.apache.org/jira/browse/KAFKA-7160
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: lambdaliu
>Priority: Minor
>  Labels: newbie
>
> We should limit the length of the group ID, because other system(such as 
> monitor system) would use the group ID when we using  kafka in production 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7160) Add check for group ID length

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736220#comment-16736220
 ] 

ASF GitHub Bot commented on KAFKA-7160:
---

rosama86 commented on pull request #6098: KAFKA-7160 Add check for group ID 
length
URL: https://github.com/apache/kafka/pull/6098
 
 
   Description: validate group id length is <= 265 byte
   
   Testing: added 2 test cases to validate behavior of using group-id = 265 
[border] & group-id > 265
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [X] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add check for group ID length
> -
>
> Key: KAFKA-7160
> URL: https://issues.apache.org/jira/browse/KAFKA-7160
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: lambdaliu
>Priority: Minor
>  Labels: newbie
>
> We should limit the length of the group ID, because other system(such as 
> monitor system) would use the group ID when we using  kafka in production 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time

2019-01-07 Thread Rajini Sivaram (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-7755.
---
   Resolution: Fixed
Fix Version/s: 2.1.1
   2.2.0

> Kubernetes - Kafka clients are resolving DNS entries only one time
> --
>
> Key: KAFKA-7755
> URL: https://issues.apache.org/jira/browse/KAFKA-7755
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
> Environment: Kubernetes
>Reporter: Loïc Monney
>Priority: Blocker
> Fix For: 2.2.0, 2.1.1
>
> Attachments: pom.xml
>
>
> *Introduction*
>  Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses 
> if the first one fails. This change has been introduced by 
> https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution 
> is now performed only one time by the clients. This is not a problem if all 
> brokers have fixed IP addresses, however this is definitely an issue when 
> Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will 
> receive another IP address, so as soon as all brokers will have been 
> restarted clients won't be able to reconnect to any broker.
> *Impact*
>  Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a 
> rolling restart is performed.
> *Root cause*
>  Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are 
> resolving DNS entries only once.
> *Proposed solution*
>  In 
> [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368]
>  Kafka clients should perform the DNS resolution again when all IP addresses 
> have been "used" (when _index_ is back to 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7160) Add check for group ID length

2019-01-07 Thread Radwa Osama (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736147#comment-16736147
 ] 

Radwa Osama commented on KAFKA-7160:


Hi all, it seems that no one is working on this one, so will look into it

> Add check for group ID length
> -
>
> Key: KAFKA-7160
> URL: https://issues.apache.org/jira/browse/KAFKA-7160
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: lambdaliu
>Priority: Minor
>  Labels: newbie
>
> We should limit the length of the group ID, because other system(such as 
> monitor system) would use the group ID when we using  kafka in production 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5061) client.id should be set for Connect producers/consumers

2019-01-07 Thread Paul Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736139#comment-16736139
 ] 

Paul Davidson commented on KAFKA-5061:
--

I have created a alternative PR after comments from [~ewencp] and 
[~kkonstantine] on KIP-411. This PR simply changes the default client id and 
avoids creating any new configuration options.  See 
https://github.com/apache/kafka/pull/6097

> client.id should be set for Connect producers/consumers
> ---
>
> Key: KAFKA-5061
> URL: https://issues.apache.org/jira/browse/KAFKA-5061
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.10.2.1
>Reporter: Ewen Cheslack-Postava
>Priority: Major
>  Labels: needs-kip, newbie++
>
> In order to properly monitor individual tasks using the producer and consumer 
> metrics, we need to have the framework disambiguate them. Currently when we 
> create producers 
> (https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L362)
>  and create consumers 
> (https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L371-L394)
>  the client ID is not being set. You can override it for the entire worker 
> via worker-level producer/consumer overrides, but you can't get per-task 
> metrics.
> There are a couple of things we might want to consider doing here:
> 1. Provide default client IDs based on the worker group ID + task ID 
> (providing uniqueness for multiple connect clusters up to the scope of the 
> Kafka cluster they are operating on). This seems ideal since it's a good 
> default; however it is a public-facing change and may need a KIP. Normally I 
> would be less worried about this, but some folks may be relying on picking up 
> metrics without this being set, in which case such a change would break their 
> monitoring.
> 2. Allow overriding client.id on a per-connector basis. I'm not sure if this 
> will really be useful or not -- it lets you differentiate between metrics for 
> different connectors' tasks, but within a connector, all metrics would go to 
> a single client.id. On the other hand, this makes the tasks act as a single 
> group from the perspective of broker handling of client IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3184) Add Checkpoint for In-memory State Store

2019-01-07 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736113#comment-16736113
 ] 

Guozhang Wang commented on KAFKA-3184:
--

[~Yohan123] I think you are right: currently `persistent()` is used to 
determine whether we should record checkpoint offsets, since for non-persistent 
stores there are no data flushed to persistent storage and therefore we can 
only restore from beginning every time upon resuming. With this JIRA in-memory 
stores would be "persisted" to storage as well, the only difference is that it 
is not written until that "snapshotting" point is reached (e.g. after N 
commits). And hence in this case persistent() flag is not that useful anyways 
and may consider deprecated.

> Add Checkpoint for In-memory State Store
> 
>
> Key: KAFKA-3184
> URL: https://issues.apache.org/jira/browse/KAFKA-3184
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: user-experience
>
> Currently Kafka Streams does not make a checkpoint of the persistent state 
> store upon committing, which would be expensive since it is "stopping the 
> world" and write on disks: for example, RocksDB would require you to copy the 
> file directory to make a copy naively. 
> However, for in-memory stores checkpointing maybe doable in an asynchronous 
> manner hence it can be done quickly. And the benefit of having intermediate 
> checkpoint is to avoid restoring from scratch if standby tasks are not 
> present.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read

2019-01-07 Thread Rajini Sivaram (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736064#comment-16736064
 ] 

Rajini Sivaram commented on KAFKA-7757:
---

Thank you [~jnadler]!

> Too many open files after java.io.IOException: Connection to n was 
> disconnected before the response was read
> 
>
> Key: KAFKA-7757
> URL: https://issues.apache.org/jira/browse/KAFKA-7757
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Pedro Gontijo
>Priority: Major
> Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, 
> fd-spike-threads.txt, kafka-allocated-file-handles.png, server.properties, 
> td1.txt, td2.txt, td3.txt
>
>
> We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers)
> After a while (hours) 2 brokers start to throw:
> {code:java}
> java.io.IOException: Connection to NN was disconnected before the response 
> was read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> File descriptors start to pile up and if I do not restart it throws "Too many 
> open files" and crashes.  
> {code:java}
> ERROR Error while accepting connection (kafka.network.Acceptor)
> java.io.IOException: Too many open files in system
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:460)
> at kafka.network.Acceptor.run(SocketServer.scala:403)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  After some hours the issue happens again... It has happened with all 
> brokers, so it is not something specific to an instance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7768) Java doc link 404

2019-01-07 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-7768.
--
   Resolution: Fixed
Fix Version/s: 2.1.1
   2.2.0

> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
> Fix For: 2.2.0, 2.1.1
>
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7768) Java doc link 404

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736072#comment-16736072
 ] 

ASF GitHub Bot commented on KAFKA-7768:
---

guozhangwang commented on pull request #6094:  KAFKA-7768: Add version to java 
html urls
URL: https://github.com/apache/kafka/pull/6094
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7768) Java doc link 404

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736068#comment-16736068
 ] 

ASF GitHub Bot commented on KAFKA-7768:
---

guozhangwang commented on pull request #176: KAFKA-7768: Add version to java 
html urls
URL: https://github.com/apache/kafka-site/pull/176
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java doc link 404 
> --
>
> Key: KAFKA-7768
> URL: https://issues.apache.org/jira/browse/KAFKA-7768
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.1.0
>Reporter: Slim Ouertani
>Priority: Critical
>
> The official documentation link example 
> [https://kafka.apache.org/documentation/streams/developer-guide/interactive-queries.html]
>  (with no release reference) not referring to valid Java doc like 
> [https://kafka.apache.org/javadoc/org/apache/kafka/streams/state/StreamsMetadata.html].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7743) "pure virtual method called" error message thrown for unit tests which PASS

2019-01-07 Thread Guozhang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736065#comment-16736065
 ] 

Guozhang Wang commented on KAFKA-7743:
--

I meant to have this parameter set to 1 (if I do not remember wrong by default 
it is 4), and see if the error happens on the same unit test case. The reason 
is that {{shouldThrowNullPointerIfValueSerdeIsNull}} should not have RocksDB 
instance at all and that makes me think that the sysout may be messed due to 
parallelism: the virtual function call error message may not be co-printed with 
the unit test it was captured.

> "pure virtual method called" error message thrown for unit tests which PASS
> ---
>
> Key: KAFKA-7743
> URL: https://issues.apache.org/jira/browse/KAFKA-7743
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Sarvesh Tamba
>Priority: Critical
>
> Observing the following messages intermittently for a few random unit tests, 
> though the status for each of them is PASSED:-
> *"pure virtual method called*
> *terminate called without an active exception"*
> Some of the unit tests throwing above messages are, besides others:-
> org.apache.kafka.streams.kstream.internals.KTableImplTest > 
> shouldThrowNullPointerOnTransformValuesWithKeyWhenMaterializedIsNull PASSED
> org.apache.kafka.streams.processor.internals.GlobalStreamThreadTest > 
> shouldThrowStreamsExceptionOnStartupIfThereIsAStreamsException PASSED
> org.apache.kafka.streams.state.internals.CachingSessionStoreTest > 
> shouldNotForwardChangedValuesDuringFlushWhenSendOldValuesDisabled PASSED
> org.apache.kafka.streams.processor.internals.StoreChangelogReaderTest > 
> shouldCompleteImmediatelyWhenEndOffsetIs0 PASSED
> org.apache.kafka.streams.kstream.internals.KStreamKStreamJoinTest > testJoin 
> PASSED
> org.apache.kafka.streams.kstream.internals.KTableFilterTest > 
> testTypeVariance PASSED
> org.apache.kafka.streams.kstream.internals.KTableKTableInnerJoinTest > 
> testNotSendingOldValues PASSED
> org.apache.kafka.streams.processor.internals.GlobalStreamThreadTest > 
> shouldThrowStreamsExceptionOnStartupIfThereIsAStreamsException PASSED
> org.apache.kafka.streams.processor.DefaultPartitionGrouperTest > 
> shouldComputeGroupingForTwoGroups PASSED
> org.apache.kafka.streams.state.internals.RocksDBWindowStoreTest > 
> shouldFetchAndIterateOverExactKeys PASSED
> org.apache.kafka.streams.state.internals.FilteredCacheIteratorTest > 
> shouldFilterEntriesNotMatchingHasNextCondition PASSED
> org.apache.kafka.streams.state.internals.GlobalStateStoreProviderTest > 
> shouldThrowExceptionIfStoreIsntOpen PASSED
> This probably causes the 'gradle unitTest' command to fail during cleanup 
> time with final status as FAILED and the following message:-
> "Process 'Gradle Test Executor 16' finished with non-zero exit value 134"
> This intermittent/random error is not seen when final unit test suite status 
> is "BUILD SUCCESSFUL".
> Reproducing "pure virtual method" issue is extremely hard, since it happens 
> intermittently and for any random unit test(not the same unit test will fail 
> next time). The ones noted above were some of the failing unit tests 
> observed. Note that the status next to the test shows PASSED(is this correct 
> or misleading?).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7791) Not log retriable exceptions as errors

2019-01-07 Thread Yaroslav Klymko (JIRA)
Yaroslav Klymko created KAFKA-7791:
--

 Summary: Not log retriable exceptions as errors
 Key: KAFKA-7791
 URL: https://issues.apache.org/jira/browse/KAFKA-7791
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 2.1.0
Reporter: Yaroslav Klymko


Background: I've spotted tons of kafka related errors in logs, after 
investigation I found out that those are harmless as being retried.
Hence I propose to not log retriable exceptions as errors.

Examples of what I've see in logs:
 * Offset commit failed on partition .. at offset ..: The request timed out.
 * Offset commit failed on partition .. at offset ..: The coordinator is 
loading and hence can't process requests.
 * Offset commit failed on partition .. at offset ..: This is not the correct 
coordinator.
 * Offset commit failed on partition .. at offset ..: This server does not host 
this topic-partition.

 

Here is attempt to fix this: https://github.com/apache/kafka/pull/5904



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read

2019-01-07 Thread Jeff Nadler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736039#comment-16736039
 ] 

Jeff Nadler commented on KAFKA-7757:


[~rsivaram] please see attached file [^fd-spike-threads.txt] it does appear to 
be the exact same deadlock as you'd identified in KAFKA-7697.

> Too many open files after java.io.IOException: Connection to n was 
> disconnected before the response was read
> 
>
> Key: KAFKA-7757
> URL: https://issues.apache.org/jira/browse/KAFKA-7757
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Pedro Gontijo
>Priority: Major
> Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, 
> fd-spike-threads.txt, kafka-allocated-file-handles.png, server.properties, 
> td1.txt, td2.txt, td3.txt
>
>
> We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers)
> After a while (hours) 2 brokers start to throw:
> {code:java}
> java.io.IOException: Connection to NN was disconnected before the response 
> was read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> File descriptors start to pile up and if I do not restart it throws "Too many 
> open files" and crashes.  
> {code:java}
> ERROR Error while accepting connection (kafka.network.Acceptor)
> java.io.IOException: Too many open files in system
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:460)
> at kafka.network.Acceptor.run(SocketServer.scala:403)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  After some hours the issue happens again... It has happened with all 
> brokers, so it is not something specific to an instance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read

2019-01-07 Thread Jeff Nadler (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Nadler updated KAFKA-7757:
---
Attachment: fd-spike-threads.txt

> Too many open files after java.io.IOException: Connection to n was 
> disconnected before the response was read
> 
>
> Key: KAFKA-7757
> URL: https://issues.apache.org/jira/browse/KAFKA-7757
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Pedro Gontijo
>Priority: Major
> Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, 
> fd-spike-threads.txt, kafka-allocated-file-handles.png, server.properties, 
> td1.txt, td2.txt, td3.txt
>
>
> We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers)
> After a while (hours) 2 brokers start to throw:
> {code:java}
> java.io.IOException: Connection to NN was disconnected before the response 
> was read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> File descriptors start to pile up and if I do not restart it throws "Too many 
> open files" and crashes.  
> {code:java}
> ERROR Error while accepting connection (kafka.network.Acceptor)
> java.io.IOException: Too many open files in system
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:460)
> at kafka.network.Acceptor.run(SocketServer.scala:403)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  After some hours the issue happens again... It has happened with all 
> brokers, so it is not something specific to an instance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7790) Trogdor - Does not time out tasks in time

2019-01-07 Thread Stanislav Kozlovski (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Kozlovski updated KAFKA-7790:
---
Description: 
All Trogdor task specifications have a defined `startMs` and `durationMs`. 
Under conditions of task failure and restarts, it is intuitive to assume that a 
task would not be re-ran after a certain time period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
re-schedules tasks that are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more 
tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can 
end up in a scenario where we overwhelm the agent with tasks that we would 
rather have dropped.
h2. Changes

We propose that the Trogdor Coordinator does not re-schedule a task if the 
current time of re-scheduling is greater than the start time of the task and 
its duration combined. More specifically:
{code:java}
if (currentTimeMs > startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 

  was:
All Trogdor task specifications have a defined `startMs` and `durationMs`. 
Under conditions of task failure and restarts, it is intuitive to assume that a 
task would not be re-ran after a certain time period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
re-schedules tasks that are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more 
tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can 
end up in a scenario where we overwhelm the agent with tasks that we would 
rather have dropped.
h2. Changes


We propose that the Trogdor Coordinator does not re-schedule a task if the 
current time of re-scheduling is greater than the start time of the task and 
its duration combined. More specifically:
{code:java}
if (currentTimeMs < startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 


> Trogdor - Does not time out tasks in time
> -
>
> Key: KAFKA-7790
> URL: https://issues.apache.org/jira/browse/KAFKA-7790
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Stanislav Kozlovski
>Assignee: Stanislav Kozlovski
>Priority: Major
>
> All Trogdor task specifications have a defined `startMs` and `durationMs`. 
> Under conditions of task failure and restarts, it is intuitive to assume that 
> a task would not be re-ran after a certain time period.
> Let's best illustrate the issue with an example:
> {code:java}
> startMs = 12PM; durationMs = 1hour;
> # 12:02 - Coordinator schedules a task to run on agent-0
> # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
> # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
> re-schedules tasks that are not running in agent-0
> # 13:20 - agent-0 process dies.
> # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
> This can result in an endless loop of task rescheduling. If there are more 
> tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we 
> can end up in a scenario where we overwhelm the agent with tasks that we 
> would rather have dropped.
> h2. Changes
> We propose that the Trogdor Coordinator does not re-schedule a task if the 
> current time of re-scheduling is greater than the start time of the task and 
> its duration combined. More specifically:
> {code:java}
> if (currentTimeMs > startTimeMs + durationTimeMs)
>   scheduleTask()
> else
>   failTask(){code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time

2019-01-07 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Loïc Monney updated KAFKA-7755:
---
Affects Version/s: (was: 2.1.1)
   (was: 2.2.0)

> Kubernetes - Kafka clients are resolving DNS entries only one time
> --
>
> Key: KAFKA-7755
> URL: https://issues.apache.org/jira/browse/KAFKA-7755
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
> Environment: Kubernetes
>Reporter: Loïc Monney
>Priority: Blocker
> Attachments: pom.xml
>
>
> *Introduction*
>  Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses 
> if the first one fails. This change has been introduced by 
> https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution 
> is now performed only one time by the clients. This is not a problem if all 
> brokers have fixed IP addresses, however this is definitely an issue when 
> Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will 
> receive another IP address, so as soon as all brokers will have been 
> restarted clients won't be able to reconnect to any broker.
> *Impact*
>  Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a 
> rolling restart is performed.
> *Root cause*
>  Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are 
> resolving DNS entries only once.
> *Proposed solution*
>  In 
> [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368]
>  Kafka clients should perform the DNS resolution again when all IP addresses 
> have been "used" (when _index_ is back to 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7695) Cannot override StreamsPartitionAssignor in configuration

2019-01-07 Thread Dmitry Buykin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Buykin resolved KAFKA-7695.
--
Resolution: Not A Bug

> Cannot override StreamsPartitionAssignor in configuration 
> --
>
> Key: KAFKA-7695
> URL: https://issues.apache.org/jira/browse/KAFKA-7695
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dmitry Buykin
>Priority: Major
>  Labels: configuration
>
> Cannot override StreamsPartitionAssignor by changing property 
> partition.assignment.strategy in KStreams 2.0.1 because the streams are 
> crashing inside KafkaClientSupplier.getGlobalConsumer. This GlobalConsumer 
> works only with RangeAssignor which configured by default.
> Could be reproduced by setting up
> `props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, 
> StreamsPartitionAssignor.class.getName());`
> For me it looks like a bug.
> Opened a discussion here 
> https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1543395977453700
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7789) SSL-related unit tests hang when run on Fedora 29

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735827#comment-16735827
 ] 

ASF GitHub Bot commented on KAFKA-7789:
---

tombentley commented on pull request #6096: KAFKA-7789: Increase size of RSA 
keys used by TestSslUtils
URL: https://github.com/apache/kafka/pull/6096
 
 
   Fix [KAFKA-7789](https://issues.apache.org/jira/browse/KAFKA-7789) by 
increasing the key size for the RSA keys generated for the tests. 
   
   ### Committer Checklist (excluded from commit message)
   - [x] Verify design and implementation 
   - [x] Verify test coverage and CI build status
   - [x] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SSL-related unit tests hang when run on Fedora 29
> -
>
> Key: KAFKA-7789
> URL: https://issues.apache.org/jira/browse/KAFKA-7789
> Project: Kafka
>  Issue Type: Bug
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Minor
>
> Various SSL-related unit tests (such as {{SslSelectorTest}}) hang when 
> executed on Fedora 29. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7695) Cannot override StreamsPartitionAssignor in configuration

2019-01-07 Thread Dmitry Buykin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735816#comment-16735816
 ] 

Dmitry Buykin commented on KAFKA-7695:
--

Hi [~vvcephei],

Yes, I'm agree that this task should be closed because become too wide.

Regards,

Dmitry.

> Cannot override StreamsPartitionAssignor in configuration 
> --
>
> Key: KAFKA-7695
> URL: https://issues.apache.org/jira/browse/KAFKA-7695
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dmitry Buykin
>Priority: Major
>  Labels: configuration
>
> Cannot override StreamsPartitionAssignor by changing property 
> partition.assignment.strategy in KStreams 2.0.1 because the streams are 
> crashing inside KafkaClientSupplier.getGlobalConsumer. This GlobalConsumer 
> works only with RangeAssignor which configured by default.
> Could be reproduced by setting up
> `props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, 
> StreamsPartitionAssignor.class.getName());`
> For me it looks like a bug.
> Opened a discussion here 
> https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1543395977453700
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735806#comment-16735806
 ] 

ASF GitHub Bot commented on KAFKA-7755:
---

rajinisivaram commented on pull request #6049: KAFKA-7755 Turn update inet 
addresses
URL: https://github.com/apache/kafka/pull/6049
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kubernetes - Kafka clients are resolving DNS entries only one time
> --
>
> Key: KAFKA-7755
> URL: https://issues.apache.org/jira/browse/KAFKA-7755
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0, 2.2.0, 2.1.1
> Environment: Kubernetes
>Reporter: Loïc Monney
>Priority: Blocker
> Attachments: pom.xml
>
>
> *Introduction*
>  Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses 
> if the first one fails. This change has been introduced by 
> https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution 
> is now performed only one time by the clients. This is not a problem if all 
> brokers have fixed IP addresses, however this is definitely an issue when 
> Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will 
> receive another IP address, so as soon as all brokers will have been 
> restarted clients won't be able to reconnect to any broker.
> *Impact*
>  Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a 
> rolling restart is performed.
> *Root cause*
>  Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are 
> resolving DNS entries only once.
> *Proposed solution*
>  In 
> [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368]
>  Kafka clients should perform the DNS resolution again when all IP addresses 
> have been "used" (when _index_ is back to 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7790) Trogdor - Does not time out tasks in time

2019-01-07 Thread Stanislav Kozlovski (JIRA)
Stanislav Kozlovski created KAFKA-7790:
--

 Summary: Trogdor - Does not time out tasks in time
 Key: KAFKA-7790
 URL: https://issues.apache.org/jira/browse/KAFKA-7790
 Project: Kafka
  Issue Type: Improvement
Reporter: Stanislav Kozlovski
Assignee: Stanislav Kozlovski


All Trogdor task specifications have a defined `startMs` and `durationMs`. 
Under conditions of task failure and restarts, it is intuitive to assume that a 
task would not be re-ran after a certain time period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
re-schedules tasks that are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more 
tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can 
end up in a scenario where we overwhelm the agent with tasks that we would 
rather have dropped.
h2. Changes


We propose that the Trogdor Coordinator does not re-schedule a task if the 
current time of re-scheduling is greater than the start time of the task and 
its duration combined. More specifically:
{code:java}
if (currentTimeMs < startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7361) Kafka wont reconnect after NoRouteToHostException

2019-01-07 Thread Igor Soarez (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735755#comment-16735755
 ] 

Igor Soarez commented on KAFKA-7361:


https://issues.apache.org/jira/browse/KAFKA-7755 seems to be addressing this

> Kafka wont reconnect after NoRouteToHostException
> -
>
> Key: KAFKA-7361
> URL: https://issues.apache.org/jira/browse/KAFKA-7361
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 1.1.0
> Environment: kubernetes cluster
>Reporter: C Schabert
>Priority: Major
>
> After Zookeeper died and came back up kafka could not reconnect to zookeeper.
> In this Setup zookeeper ist behind a dns and came up with a different ip.
>  
> Here is the kafka log output:
>  
> {code:java}
> [2018-08-30 14:50:23,846] INFO Opening socket connection to server 
> zookeeper-0.zookeeper./10.42.0.123:2181. Will not attempt to authenticate 
> using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
> 8/30/2018 4:50:26 PM [2018-08-30 14:50:26,916] WARN Session 0x1658b2f0f4e0002 
> for server null, unexpected error, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> 8/30/2018 4:50:26 PM java.net.NoRouteToHostException: No route to host
> 8/30/2018 4:50:26 PM at sun.nio.ch.SocketChannelImpl.checkConnect(Native 
> Method)
> 8/30/2018 4:50:26 PM at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> 8/30/2018 4:50:26 PM at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
> 8/30/2018 4:50:26 PM at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7789) SSL-related unit tests hang when run on Fedora 29

2019-01-07 Thread Tom Bentley (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735729#comment-16735729
 ] 

Tom Bentley commented on KAFKA-7789:


This is caused by Fedora tightening up its system-wide crypto policies, as 
described here: https://fedoraproject.org/wiki/Changes/StrongCryptoSettings2. 
Their changes to {{/etc/crypto-policies/back-ends/java.config}} set 
{{jdk.certpath.disabledAlgorithms=MD2, MD5, DSA, RSA keySize < 2048}} thus 
causing the KeyManager to reject RSA keys with size < 2048bits. The rejection 
of the keys happens silently unless 
{{-Djavax.net.debug=ssl,handshake,keymanager}} system property is set. The 
{{TestSslUtils}} generates its keys with 1024 bit keys.

Fedora 29 users can change the policy to LEGACY with {{update-crypto-policies 
--set LEGACY}} as root, but this enables the LEGACY algorithm support 
system-wide. 
The better option would be to update the unit tests to use 2048 bit keys.

> SSL-related unit tests hang when run on Fedora 29
> -
>
> Key: KAFKA-7789
> URL: https://issues.apache.org/jira/browse/KAFKA-7789
> Project: Kafka
>  Issue Type: Bug
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Minor
>
> Various SSL-related unit tests (such as {{SslSelectorTest}}) hang when 
> executed on Fedora 29. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7789) SSL-related unit tests hang when run on Fedora 29

2019-01-07 Thread Tom Bentley (JIRA)
Tom Bentley created KAFKA-7789:
--

 Summary: SSL-related unit tests hang when run on Fedora 29
 Key: KAFKA-7789
 URL: https://issues.apache.org/jira/browse/KAFKA-7789
 Project: Kafka
  Issue Type: Bug
Reporter: Tom Bentley
Assignee: Tom Bentley


Various SSL-related unit tests (such as {{SslSelectorTest}}) hang when executed 
on Fedora 29. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7757) Too many open files after java.io.IOException: Connection to n was disconnected before the response was read

2019-01-07 Thread Rajini Sivaram (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735600#comment-16735600
 ] 

Rajini Sivaram commented on KAFKA-7757:
---

This could be due to KAFKA-7697. Can you get a thread dump of a broker when the 
file descriptors are going up and attach to the JIRA? Thank you!

> Too many open files after java.io.IOException: Connection to n was 
> disconnected before the response was read
> 
>
> Key: KAFKA-7757
> URL: https://issues.apache.org/jira/browse/KAFKA-7757
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Pedro Gontijo
>Priority: Major
> Attachments: Screen Shot 2019-01-03 at 12.20.38 PM.png, 
> kafka-allocated-file-handles.png, server.properties, td1.txt, td2.txt, td3.txt
>
>
> We upgraded from 0.10.2.2 to 2.1.0 (a cluster with 3 brokers)
> After a while (hours) 2 brokers start to throw:
> {code:java}
> java.io.IOException: Connection to NN was disconnected before the response 
> was read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}
> File descriptors start to pile up and if I do not restart it throws "Too many 
> open files" and crashes.  
> {code:java}
> ERROR Error while accepting connection (kafka.network.Acceptor)
> java.io.IOException: Too many open files in system
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:460)
> at kafka.network.Acceptor.run(SocketServer.scala:403)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  After some hours the issue happens again... It has happened with all 
> brokers, so it is not something specific to an instance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)