[jira] [Created] (KAFKA-5478) TransactionsTest.testFencingOnAddPartitions Timeout

2017-06-19 Thread Jason Gustafson (JIRA)
Jason Gustafson created KAFKA-5478:
--

 Summary: TransactionsTest.testFencingOnAddPartitions Timeout
 Key: KAFKA-5478
 URL: https://issues.apache.org/jira/browse/KAFKA-5478
 Project: Kafka
  Issue Type: Sub-task
Reporter: Jason Gustafson
Assignee: Jason Gustafson


Spotted here: 
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/5468/console.
{code}
kafka.api.TransactionsTest > testFencingOnAddPartitions STARTED
Build timed out (after 180 minutes). Marking the build as aborted.
Build was aborted
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5472) Connector validate REST endpoint returning duplicate entries in "groups"

2017-06-19 Thread Randall Hauch (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch updated KAFKA-5472:
-
Fix Version/s: 0.11.0.0

> Connector validate REST endpoint returning duplicate entries in "groups"
> 
>
> Key: KAFKA-5472
> URL: https://issues.apache.org/jira/browse/KAFKA-5472
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0
>Reporter: Randall Hauch
>Assignee: Randall Hauch
>Priority: Blocker
> Fix For: 0.11.0.0
>
>
> When calling the REST endpoint to validate a connector configuration, the 
> response includes duplicate values for the {{groups}} field:
> {code:javascript}
> {
>   // ...
>   groups: ["Common", "Transforms", "Common", "Transforms"]
> }
> {code}
> This is a regression and may cause issues for any tooling that is using the 
> Kafka Connect REST API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5472) Connector validate REST endpoint returning duplicate entries in "groups"

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055019#comment-16055019
 ] 

ASF GitHub Bot commented on KAFKA-5472:
---

GitHub user rhauch opened a pull request:

https://github.com/apache/kafka/pull/3379

KAFKA-5472 Eliminated duplicate group names when validating connector 
results

Kafka Connect was adding duplicate group names in the response from the 
REST API's validation of connector configurations. This fixes the duplicates 
and maintains the order of the `ConfigDef` objects so that the `ConfigValue` 
results are in the same order.

This is a blocker and should be merged to 0.11.0.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rhauch/kafka KAFKA-5472

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3379


commit d8e6168584cab68ff0844b6d18d8cdb79792427b
Author: Randall Hauch 
Date:   2017-06-20T00:52:28Z

KAFKA-5472 Eliminated duplicate group names when validating connector 
results




> Connector validate REST endpoint returning duplicate entries in "groups"
> 
>
> Key: KAFKA-5472
> URL: https://issues.apache.org/jira/browse/KAFKA-5472
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0
>Reporter: Randall Hauch
>Assignee: Randall Hauch
>Priority: Blocker
>
> When calling the REST endpoint to validate a connector configuration, the 
> response includes duplicate values for the {{groups}} field:
> {code:javascript}
> {
>   // ...
>   groups: ["Common", "Transforms", "Common", "Transforms"]
> }
> {code}
> This is a regression and may cause issues for any tooling that is using the 
> Kafka Connect REST API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5477) TransactionalProducer sleeps unnecessarily long during back to back transactions

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054989#comment-16054989
 ] 

ASF GitHub Bot commented on KAFKA-5477:
---

GitHub user apurvam opened a pull request:

https://github.com/apache/kafka/pull/3377

KAFKA-5477: Lower retryBackoff for AddPartitionsRequest

This patch lowers the retry backoff when receiving a 
CONCURRENT_TRANSACTIONS error from an AddPartitions request. The default of 
100ms would mean that back to back transactions would be 100ms long at minimum, 
making things to slow.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apurvam/kafka 
HOTFIX-lower-retry-for-add-partitions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3377.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3377


commit 0d676688e7ed9a8d63189eb704143e62752707cc
Author: Apurva Mehta 
Date:   2017-06-20T00:36:28Z

Lower retryBackoff when receiving a CONCURRENT_TRANSACTIONS error from an 
AddPartitions request




> TransactionalProducer sleeps unnecessarily long during back to back 
> transactions
> 
>
> Key: KAFKA-5477
> URL: https://issues.apache.org/jira/browse/KAFKA-5477
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>
> I am running some perf tests for EOS and there is a severe perf impact with 
> our default configs. 
> Here is the issue.
> # When we do a commit transaction, the producer sends an `EndTxn` request to 
> the coordinator. The coordinator writes the `PrepareCommit` message to the 
> transaction log and then returns the response the client. It writes the 
> transaction markers and the final 'CompleteCommit' message asynchronously. 
> # In the mean time, if the client starts another transaction, it will send an 
> `AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
> been written yet, then the coordinator will return a retriable 
> `CONCURRENT_TRANSACTIONS` error to the client.
> # The current behavior in the producer is to sleep for `retryBackoffMs` 
> before retrying the request. The current default for this is 100ms. So the 
> producer will sleep for 100ms before sending the `AddPartitions` again. This 
> puts a floor on the latency for back to back transactions.
> The impact: Back to back transactions (the typical usecase for streams) would 
> have a latency floor of 100ms.
> Ideally, we don't want to sleep the full 100ms  in this particular case, 
> because the retry is 'expected'.
> The options are: 
> # do nothing, let streams override the retry.backoff.ms in their producer to 
> 10 when EOS is enabled (since they have a HOTFIX patch out anyway).
> # Introduce a special 'transactionRetryBackoffMs' non-configurable variable 
> and hard code that to a low value which applies to all transactional requests.
> # do nothing and fix it properly in 0.11.0.1 
> Option 2 as stated is a 1 line fix. If we want to lower the retry just for 
> this particular error, it would be a slightly bigger change (10-15 lines).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5477) TransactionalProducer sleeps unnecessarily long during back to back transactions

2017-06-19 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5477:

Description: 
I am running some perf tests for EOS and there is a severe perf impact with our 
default configs. 

Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for back to back transactions.

The impact: Back to back transactions (the typical usecase for streams) would 
have a latency floor of 100ms.

Ideally, we don't want to sleep the full 100ms  in this particular case, 
because the retry is 'expected'.

The options are: 
# do nothing, let streams override the retry.backoff.ms in their producer to 10 
when EOS is enabled (since they have a HOTFIX patch out anyway).
# Introduce a special 'transactionRetryBackoffMs' non-configurable variable and 
hard code that to a low value which applies to all transactional requests.
# do nothing and fix it properly in 0.11.0.1 

Option 2 as stated is a 1 line fix. If we want to lower the retry just for this 
particular error, it would be a slightly bigger change (10-15 lines).

  was:
I am running some perf tests for EOS and there is a severe perf impact with our 
default configs. 

Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for back to back transactions.

Ideally, we don't want to sleep the full 100ms  in this particular case, 
because the retry is 'expected'.

The options are: 
# do nothing, let streams override the retry.backoff.ms in their producer to 10 
when EOS is enabled (since they have a HOTFIX patch out anyway).
# Introduce a special 'transactionRetryBackoffMs' non-configurable variable and 
hard code that to a low value which applies to all transactional requests.
# do nothing and fix it properly in 0.11.0.1 


> TransactionalProducer sleeps unnecessarily long during back to back 
> transactions
> 
>
> Key: KAFKA-5477
> URL: https://issues.apache.org/jira/browse/KAFKA-5477
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>
> I am running some perf tests for EOS and there is a severe perf impact with 
> our default configs. 
> Here is the issue.
> # When we do a commit transaction, the producer sends an `EndTxn` request to 
> the coordinator. The coordinator writes the `PrepareCommit` message to the 
> transaction log and then returns the response the client. It writes the 
> transaction markers and the final 'CompleteCommit' message asynchronously. 
> # In the mean time, if the client starts another transaction, it will send an 
> `AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
> been written yet, then the coordinator will return a retriable 
> `CONCURRENT_TRANSACTIONS` error to the client.
> # The current behavior in the producer is to sleep for `retryBackoffMs` 
> before retrying the request. The current default for this is 100ms. So the 
> producer will sleep for 100ms before sending the `AddPartitions` again. This 
> puts a floor on the latency for back to back transactions.
> The impact: Back to back transactions (the typical usecase for streams) would 
> have a latency floor of 100ms.
> Ideally, we don't want to sleep the full 100ms 

[jira] [Updated] (KAFKA-5477) TransactionalProducer sleeps unnecessarily long during back to back transactions

2017-06-19 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5477:

Description: 
I am running some perf tests for EOS and there is a severe perf impact with our 
default configs. 

Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for back to back transactions.

Ideally, we don't want to sleep the full 100ms  in this particular case, 
because the retry is 'expected'.

The options are: 
# do nothing, let streams override the retry.backoff.ms in their producer to 10 
when EOS is enabled (since they have a HOTFIX patch out anyway).
# Introduce a special 'transactionRetryBackoffMs' non-configurable variable and 
hard code that to a low value which applies to all transactional requests.
# do nothing and fix it properly in 0.11.0.1 

  was:
I am running some perf tests for EOS and there is a severe perf impact with our 
default configs. 

Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for transactions.

Ideally, we don't want to sleep the full 100ms  in this particular case, 
because the retry is 'expected'.

The options are: 
# do nothing, let streams override the retry.backoff.ms in their producer to 10 
when EOS is enabled (since they have a HOTFIX patch out anyway).
# Introduce a special 'transactionRetryBackoffMs' non-configurable variable and 
hard code that to a low value which applies to all transactional requests.
# do nothing and fix it properly in 0.11.0.1 


> TransactionalProducer sleeps unnecessarily long during back to back 
> transactions
> 
>
> Key: KAFKA-5477
> URL: https://issues.apache.org/jira/browse/KAFKA-5477
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>
> I am running some perf tests for EOS and there is a severe perf impact with 
> our default configs. 
> Here is the issue.
> # When we do a commit transaction, the producer sends an `EndTxn` request to 
> the coordinator. The coordinator writes the `PrepareCommit` message to the 
> transaction log and then returns the response the client. It writes the 
> transaction markers and the final 'CompleteCommit' message asynchronously. 
> # In the mean time, if the client starts another transaction, it will send an 
> `AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
> been written yet, then the coordinator will return a retriable 
> `CONCURRENT_TRANSACTIONS` error to the client.
> # The current behavior in the producer is to sleep for `retryBackoffMs` 
> before retrying the request. The current default for this is 100ms. So the 
> producer will sleep for 100ms before sending the `AddPartitions` again. This 
> puts a floor on the latency for back to back transactions.
> Ideally, we don't want to sleep the full 100ms  in this particular case, 
> because the retry is 'expected'.
> The options are: 
> # do nothing, let streams override the retry.backoff.ms in their producer to 
> 10 when EOS is enabled (since they have a HOTFIX patch out anyway).
> # Introduce a special 'transactionRetryBackoffMs' non-configurable variable 
> and hard code that to a low value which applies to all transactional requests

[jira] [Created] (KAFKA-5477) TransactionalProducer sleeps unnecessarily long during back to back transactions

2017-06-19 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5477:
---

 Summary: TransactionalProducer sleeps unnecessarily long during 
back to back transactions
 Key: KAFKA-5477
 URL: https://issues.apache.org/jira/browse/KAFKA-5477
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta


I am running some perf tests for EOS and there is a severe perf impact with our 
default configs. 

Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for transactions.

Ideally, we don't want to sleep the full 100ms  in this particular case, 
because the retry is 'expected'.

The options are: 
# do nothing, let streams override the retry.backoff.ms in their producer to 10 
when EOS is enabled (since they have a HOTFIX patch out anyway).
# Introduce a special 'transactionRetryBackoffMs' non-configurable variable and 
hard code that to a low value which applies to all transactional requests.
# do nothing and fix it properly in 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5476) Implement a system test that creates network partitions

2017-06-19 Thread Colin P. McCabe (JIRA)
Colin P. McCabe created KAFKA-5476:
--

 Summary: Implement a system test that creates network partitions
 Key: KAFKA-5476
 URL: https://issues.apache.org/jira/browse/KAFKA-5476
 Project: Kafka
  Issue Type: Test
Reporter: Colin P. McCabe
Assignee: Colin P. McCabe


Implement a system test that creates network partitions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established

2017-06-19 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054960#comment-16054960
 ] 

Jun Rao commented on KAFKA-5473:


Yes, the issue is there since KAFKA-2405 was committed. So, it's affecting all 
versions since 0.9.0.0.

> handle ZK session expiration properly when a new session can't be established
> -
>
> Key: KAFKA-5473
> URL: https://issues.apache.org/jira/browse/KAFKA-5473
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 0.9.0.0
>Reporter: Jun Rao
>
> In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in 
> handling ZK session expiration a bit. If a new ZK session can't be 
> established after session expiration, we just log an error and continue. 
> However, this can leave the broker in a bad state since it's up, but not 
> registered from the controller's perspective. Replicas on this broker may 
> never to be in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5465) FetchResponse v0 does not return any messages when max_bytes smaller than v2 message set

2017-06-19 Thread Jeff Widman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054956#comment-16054956
 ] 

Jeff Widman commented on KAFKA-5465:


I agree with the "WontFix" resolution; however, I'm scared that I'll forget 
about this when upgrading our cluster. Can a link to this issue be added to the 
Upgrade instructions for 0.11?

> FetchResponse v0 does not return any messages when max_bytes smaller than v2 
> message set 
> -
>
> Key: KAFKA-5465
> URL: https://issues.apache.org/jira/browse/KAFKA-5465
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Dana Powers
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.11.0.1
>
>
> In prior releases, when consuming uncompressed messages, FetchResponse v0 
> will return a message if it is smaller than the max_bytes sent in the 
> FetchRequest. In 0.11.0.0 RC0, when messages are stored as v2 internally, the 
> response will be empty unless the full MessageSet is smaller than max_bytes. 
> In some configurations, this may cause old consumers to get stuck on large 
> messages where previously they were able to make progress one message at a 
> time.
> For example, when I produce 10 5KB messages using ProduceRequest v0 and then 
> attempt FetchRequest v0 with partition max bytes = 6KB (larger than a single 
> message but smaller than all 10 messages together), I get an empty message 
> set from 0.11.0.0. Previous brokers would have returned a single message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established

2017-06-19 Thread Jeff Widman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054955#comment-16054955
 ] 

Jeff Widman commented on KAFKA-5473:


Saw this was tagged as 0.9.0... Does this affect any of the 0.10 or 0.11 
releases?

> handle ZK session expiration properly when a new session can't be established
> -
>
> Key: KAFKA-5473
> URL: https://issues.apache.org/jira/browse/KAFKA-5473
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 0.9.0.0
>Reporter: Jun Rao
>
> In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in 
> handling ZK session expiration a bit. If a new ZK session can't be 
> established after session expiration, we just log an error and continue. 
> However, this can leave the broker in a bad state since it's up, but not 
> registered from the controller's perspective. Replicas on this broker may 
> never to be in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4201) Add an --assignment-strategy option to new-consumer-based Mirror Maker

2017-06-19 Thread Vahid Hashemian (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054915#comment-16054915
 ] 

Vahid Hashemian commented on KAFKA-4201:


[~johnma] It seems to me that the focus of 
[KAFKA-2111|https://issues.apache.org/jira/browse/KAFKA-2111] is on existing 
arguments of Kafka tools. [This 
JIRA|https://issues.apache.org/jira/browse/KAFKA-4201] however introduces a new 
argument. I would suggest handling existing and approved arguments in 
[KAFKA-2111|https://issues.apache.org/jira/browse/KAFKA-2111]. If necessary, 
the [existing PR for this JIRA|https://github.com/apache/kafka/pull/1912] can 
later be rebased to comply with any standardization introduced in 
[KAFKA-2111|https://issues.apache.org/jira/browse/KAFKA-2111]. I hope it makes 
sense. Please advise if you disagree. Thanks. 

> Add an --assignment-strategy option to new-consumer-based Mirror Maker
> --
>
> Key: KAFKA-4201
> URL: https://issues.apache.org/jira/browse/KAFKA-4201
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
> Fix For: 0.11.1.0
>
>
> The default assignment strategy in mirror maker will be changed from range to 
> round robin in an upcoming release ([see 
> KAFKA-3818|https://issues.apache.org/jira/browse/KAFKA-3818]). In order to 
> make it easier for users to change the assignment strategy, add an 
> {{--assignment-strategy}} option to Mirror Maker command line tool.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5475) Connector config validation REST API endpoint not including fields for transformations

2017-06-19 Thread Ewen Cheslack-Postava (JIRA)
Ewen Cheslack-Postava created KAFKA-5475:


 Summary: Connector config validation REST API endpoint not 
including fields for transformations
 Key: KAFKA-5475
 URL: https://issues.apache.org/jira/browse/KAFKA-5475
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Reporter: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.11.0.0


An issue with how embedded transformation configurations are included seems to 
have been introduced during 0.11.0.0. We are no longer seeing the 
`transforms..type` being included in the validation output.

{code}
 curl -X PUT -H "Content-Type: application/json" --data '{"connector.class": 
"org.apache.kafka.connect.file.FileStreamSourceConnector", "transforms": 
"foo,bar"}' 
http://localhost:8083/connector-plugins/org.apache.kafka.connect.file.FileStreamSourceConnector/config/validate
  | jq
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100  3428  100  3325  100   103   344k  10917 --:--:-- --:--:-- --:--:--  360k
{
  "name": "org.apache.kafka.connect.file.FileStreamSourceConnector",
  "error_count": 1,
  "groups": [
"Common",
"Transforms"
  ],
  "configs": [
{
  "definition": {
"name": "value.converter",
"type": "CLASS",
"required": false,
"default_value": null,
"importance": "LOW",
"documentation": "Converter class used to convert between Kafka Connect 
format and the serialized form that is written to Kafka. This controls the 
format of the values in messages written to or read from Kafka, and since this 
is independent of connectors it allows any connector to work with any 
serialization format. Examples of common formats include JSON and Avro.",
"group": "Common",
"width": "SHORT",
"display_name": "Value converter class",
"dependents": [],
"order": 5
  },
  "value": {
"name": "value.converter",
"value": null,
"recommended_values": [],
"errors": [],
"visible": true
  }
},
{
  "definition": {
"name": "name",
"type": "STRING",
"required": true,
"default_value": null,
"importance": "HIGH",
"documentation": "Globally unique name to use for this connector.",
"group": "Common",
"width": "MEDIUM",
"display_name": "Connector name",
"dependents": [],
"order": 1
  },
  "value": {
"name": "name",
"value": null,
"recommended_values": [],
"errors": [
  "Missing required configuration \"name\" which has no default value."
],
"visible": true
  }
},
{
  "definition": {
"name": "tasks.max",
"type": "INT",
"required": false,
"default_value": "1",
"importance": "HIGH",
"documentation": "Maximum number of tasks to use for this connector.",
"group": "Common",
"width": "SHORT",
"display_name": "Tasks max",
"dependents": [],
"order": 3
  },
  "value": {
"name": "tasks.max",
"value": "1",
"recommended_values": [],
"errors": [],
"visible": true
  }
},
{
  "definition": {
"name": "connector.class",
"type": "STRING",
"required": true,
"default_value": null,
"importance": "HIGH",
"documentation": "Name or alias of the class for this connector. Must 
be a subclass of org.apache.kafka.connect.connector.Connector. If the connector 
is org.apache.kafka.connect.file.FileStreamSinkConnector, you can either 
specify this full name,  or use \"FileStreamSink\" or 
\"FileStreamSinkConnector\" to make the configuration a bit shorter",
"group": "Common",
"width": "LONG",
"display_name": "Connector class",
"dependents": [],
"order": 2
  },
  "value": {
"name": "connector.class",
"value": "org.apache.kafka.connect.file.FileStreamSourceConnector",
"recommended_values": [],
"errors": [],
"visible": true
  }
},
{
  "definition": {
"name": "key.converter",
"type": "CLASS",
"required": false,
"default_value": null,
"importance": "LOW",
"documentation": "Converter class used to convert between Kafka Connect 
format and the serialized form that is written to Kafka. This controls the 
format of the keys in messages written to or read from Kafka, and since this is 
independent of connectors it allows any connector to work with any 
serialization format. Examples of common formats include JSON and Avro.",
"group": "Common",
"width": "SHORT",
"display_name

[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2017-06-19 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054906#comment-16054906
 ] 

Matthias J. Sax commented on KAFKA-4437:


Thanks [~avi_parkassist]. Updated :)

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4857) Use AdminClient in Kafka Streams

2017-06-19 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054904#comment-16054904
 ] 

Matthias J. Sax commented on KAFKA-4857:


Doing the KIP is part of the JIRA, so it's your call to do it :) Sorry if this 
was not clear. Please have a look at 
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals 
Let us know if you need any help.

However, as we did not do the switch to {{AdminClient}} yet, I would recommend 
to defer this JIRA until this internal change was done. Otherwise, we need to 
update {{KafkaClientSupplier}} twice.

> Use AdminClient in Kafka Streams
> 
>
> Key: KAFKA-4857
> URL: https://issues.apache.org/jira/browse/KAFKA-4857
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Sharad
>  Labels: needs-kip
>
> Streams uses {{KafkaClientSupplier}} to get 
> consumer/restore-consumer/producer clients. Streams also uses one more client 
> for admin purpose namely {{StreamsKafkaClient}} that is instantiated 
> "manually".
> With the newly upcoming {{AdminClient}} from KIP-117, we can simplify (or 
> even replace {{StreamsKafkaClient}} with the new {{AdminClient}}. We 
> furthermore want to unify how the client in generated and extend 
> {{KafkaClientSupplier}} with method that return this client.
> As this is a public API change, a KIP is required.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4201) Add an --assignment-strategy option to new-consumer-based Mirror Maker

2017-06-19 Thread Mariam John (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054902#comment-16054902
 ] 

Mariam John commented on KAFKA-4201:


Vahid,

  Can I assign this to myself since I am working on KAFKA-2111 which is related 
to this? Thanks.

> Add an --assignment-strategy option to new-consumer-based Mirror Maker
> --
>
> Key: KAFKA-4201
> URL: https://issues.apache.org/jira/browse/KAFKA-4201
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
> Fix For: 0.11.1.0
>
>
> The default assignment strategy in mirror maker will be changed from range to 
> round robin in an upcoming release ([see 
> KAFKA-3818|https://issues.apache.org/jira/browse/KAFKA-3818]). In order to 
> make it easier for users to change the assignment strategy, add an 
> {{--assignment-strategy}} option to Mirror Maker command line tool.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2017-06-19 Thread Avi Flax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054888#comment-16054888
 ] 

Avi Flax commented on KAFKA-4437:
-

The KIP page says:

bq. Released: 0.10.2.0

Seems like it should be updated.

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5474) Streams StandbyTask should no checkpoint on commit if EOS is enabled

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054804#comment-16054804
 ] 

ASF GitHub Bot commented on KAFKA-5474:
---

GitHub user mjsax opened a pull request:

https://github.com/apache/kafka/pull/3375

KAFKA-5474: Streams StandbyTask should no checkpoint on commit if EOS is 
enabled

 - actual fix for `StandbyTask#commit()`
Additionally (for debugging):
 - EOS test, does not report "expected" value correctly
 - `StreamThread` does not report "standby tasks" correctly
 - add `IntegerDecoder` (to be use with `kafka.tools.DumpLogSegments`)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/kafka kafka-5474-eos-standby-task

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3375.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3375


commit 89ea5aa85659177907b6ab14abce1a37cb24f21a
Author: Matthias J. Sax 
Date:   2017-06-19T21:47:48Z

KAFKA-5474: Streams StandbyTask should no checkpoint on commit if EOS is 
enabled

commit 2b6349d61c2959c671d3f5157fbc4aa40c0f47d1
Author: Matthias J. Sax 
Date:   2017-06-19T21:49:28Z

Debug, cleanup, and minor fixes




> Streams StandbyTask should no checkpoint on commit if EOS is enabled
> 
>
> Key: KAFKA-5474
> URL: https://issues.apache.org/jira/browse/KAFKA-5474
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> Discovered by system test {{streams_eos_test#test_failure_and_recovery}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5474) Streams StandbyTask should no checkpoint on commit if EOS is enabled

2017-06-19 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created KAFKA-5474:
--

 Summary: Streams StandbyTask should no checkpoint on commit if EOS 
is enabled
 Key: KAFKA-5474
 URL: https://issues.apache.org/jira/browse/KAFKA-5474
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 0.11.0.0
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Blocker
 Fix For: 0.11.0.0


Discovered by system test {{streams_eos_test#test_failure_and_recovery}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established

2017-06-19 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054781#comment-16054781
 ] 

Jun Rao commented on KAFKA-5473:


One thing to note is that the construction of the ZooKeeper object can throw an 
exception.

org.I0Itec.zkclient.exception.ZkException: Unable to connect to 
zookeeper.host.xxx.xxx:2181
at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:72)
at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1289)
at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:721)
at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:632)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
Caused by: java.net.UnknownHostException: zookeeper.host.xxx.xxx
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at 
org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:445)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:380)
at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:70)

> handle ZK session expiration properly when a new session can't be established
> -
>
> Key: KAFKA-5473
> URL: https://issues.apache.org/jira/browse/KAFKA-5473
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 0.9.0.0
>Reporter: Jun Rao
>
> In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in 
> handling ZK session expiration a bit. If a new ZK session can't be 
> established after session expiration, we just log an error and continue. 
> However, this can leave the broker in a bad state since it's up, but not 
> registered from the controller's perspective. Replicas on this broker may 
> never to be in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-2405) KafkaHealthCheck kills the JVM in handleSessionEstablishmentError

2017-06-19 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054764#comment-16054764
 ] 

Jun Rao commented on KAFKA-2405:


Just realized this change. This seems to cause the ZK session expiration 
handling to be worse than before. Filed 
https://issues.apache.org/jira/browse/KAFKA-5473 to address this.

> KafkaHealthCheck kills the JVM in handleSessionEstablishmentError
> -
>
> Key: KAFKA-2405
> URL: https://issues.apache.org/jira/browse/KAFKA-2405
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: jaikiran pai
>Assignee: jaikiran pai
> Fix For: 0.9.0.0
>
>
> The current code in KafkaHealthCheck in trunk does this:
> {code}
> override def handleSessionEstablishmentError(error: Throwable): Unit = {
>   fatal("Could not establish session with zookeeper", error)
>   System.exit(-1)
> }
> {code}
> thus terminating the JVM. A session establishment error shouldn't cause the 
> JVM to terminate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5032) Think through implications of max.message.size affecting record batches in message format V2

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054763#comment-16054763
 ] 

ASF GitHub Bot commented on KAFKA-5032:
---

GitHub user apurvam opened a pull request:

https://github.com/apache/kafka/pull/3374

KAFKA-5032: Update the docs for message size configs across the board

Before 0.11, we used to have limits for maximum message size on the 
producer, broker, and consumer side.

From 0.11 onward, these limits apply to record batches as a whole. This 
patch updates the documentation of the configs to make this explicit. 

A separate patch will have more extensive upgrade notes to tie all the 
changes together in one narrative. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apurvam/kafka KAFKA-5032-message-size-docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3374.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3374


commit b8e1379a54d21141a22694b2aa6d422709bfb89f
Author: Apurva Mehta 
Date:   2017-06-19T21:20:03Z

Change references to 'message' in the size options to 'record batch', since 
everything is written and read in batches in the current version.




> Think through implications of max.message.size affecting record batches in 
> message format V2
> 
>
> Key: KAFKA-5032
> URL: https://issues.apache.org/jira/browse/KAFKA-5032
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> It's worth noting that the new behaviour for uncompressed messages is the 
> same as the existing behaviour for compressed messages.
> A few things to think about:
> 1. Do the producer settings max.request.size and batch.size still make sense 
> and do we need to update the documentation? My conclusion is that things are 
> still fine, but we may need to revise the docs.
> 2. (Seems like we don't need to do this) Consider changing default max 
> message set size to include record batch overhead. This is currently defined 
> as:
> {code}
> val MessageMaxBytes = 100 + MessageSet.LogOverhead
> {code}
> We should consider changing it to (I haven't thought it through though):
> {code}
> val MessageMaxBytes = 100 + DefaultRecordBatch.RECORD_BATCH_OVERHEAD
> {code}
> 3. When a record batch is too large, we throw RecordTooLargeException, which 
> is confusing because there's also a RecordBatchTooLargeException. We should 
> consider renaming these exceptions to make the behaviour clearer.
> 4. We should consider deprecating max.message.bytes (server config) and 
> message.max.bytes (topic config) in favour of configs that make it clear that 
> we are talking about record batches instead of individual messages.
> Part of the work in this JIRA is working out what should be done for 0.11.0.0 
> and what can be done later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5472) Connector validate REST endpoint returning duplicate entries in "groups"

2017-06-19 Thread Randall Hauch (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch reassigned KAFKA-5472:


 Assignee: Randall Hauch
Affects Version/s: 0.11.0.0
 Priority: Blocker  (was: Major)
  Component/s: KafkaConnect

The logic changed recently as [part of the classloader isolation 
changes|https://github.com/apache/kafka/commit/45f2261763eac5caaebf860daab32ef5337c9293#diff-fad18b064f6a9aa922ad13e6d7e60df3R252],
 and it looks to me like this introduced duplicate validation of the basic 
config definitions that contain properties in the "Common" and "Transform" 
groups. Prior to the classloading feature, the connector-independent (i.e., 
basic) configuration definitions were validated and then the connector-specific 
configurations were validated, and all groups added to the list of group names. 
However, with the classloading feature this logic was changed to validate the 
connector-independent (i.e., basic) configuration first, and then the 
connector-specific configurations are enriched with the basic configuration 
definitions and then all of these are validated. That leads to validating the 
basic configuration definitions twice.

There are several approaches to correct this:
* Simply remove duplicates from the list of all group names. This would still 
lead to the basic configurations being performed twice, but it is simple and 
low-risk.
* Remove [lines 
245-252|https://github.com/apache/kafka/blob/45f2261763eac5caaebf860daab32ef5337c9293/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractHerder.java#L245-L252]
 that validate just the basic config definitions, since the remaining code 
would enrich the connector-specific configuration definitions with the basic 
config defs and validate all of them. This approach is probably more correct, 
as long as we ensure that the basic config defs are validated first.
* Remove the enrichment process so that the second validation step only 
validates the connector-specific configuration properties. This is closest to 
the previous behavior.

> Connector validate REST endpoint returning duplicate entries in "groups"
> 
>
> Key: KAFKA-5472
> URL: https://issues.apache.org/jira/browse/KAFKA-5472
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0
>Reporter: Randall Hauch
>Assignee: Randall Hauch
>Priority: Blocker
>
> When calling the REST endpoint to validate a connector configuration, the 
> response includes duplicate values for the {{groups}} field:
> {code:javascript}
> {
>   // ...
>   groups: ["Common", "Transforms", "Common", "Transforms"]
> }
> {code}
> This is a regression and may cause issues for any tooling that is using the 
> Kafka Connect REST API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established

2017-06-19 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054757#comment-16054757
 ] 

Jun Rao commented on KAFKA-5473:


The simplest solution is to just fail the broker if the new ZK session can't be 
established.

A more complicated solution is to retry for some time.

> handle ZK session expiration properly when a new session can't be established
> -
>
> Key: KAFKA-5473
> URL: https://issues.apache.org/jira/browse/KAFKA-5473
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 0.9.0.0
>Reporter: Jun Rao
>
> In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in 
> handling ZK session expiration a bit. If a new ZK session can't be 
> established after session expiration, we just log an error and continue. 
> However, this can leave the broker in a bad state since it's up, but not 
> registered from the controller's perspective. Replicas on this broker may 
> never to be in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established

2017-06-19 Thread Jun Rao (JIRA)
Jun Rao created KAFKA-5473:
--

 Summary: handle ZK session expiration properly when a new session 
can't be established
 Key: KAFKA-5473
 URL: https://issues.apache.org/jira/browse/KAFKA-5473
 Project: Kafka
  Issue Type: Sub-task
Affects Versions: 0.9.0.0
Reporter: Jun Rao


In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in 
handling ZK session expiration a bit. If a new ZK session can't be established 
after session expiration, we just log an error and continue. However, this can 
leave the broker in a bad state since it's up, but not registered from the 
controller's perspective. Replicas on this broker may never to be in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5472) Connector validate REST endpoint returning duplicate entries in "groups"

2017-06-19 Thread Randall Hauch (JIRA)
Randall Hauch created KAFKA-5472:


 Summary: Connector validate REST endpoint returning duplicate 
entries in "groups"
 Key: KAFKA-5472
 URL: https://issues.apache.org/jira/browse/KAFKA-5472
 Project: Kafka
  Issue Type: Bug
Reporter: Randall Hauch


When calling the REST endpoint to validate a connector configuration, the 
response includes duplicate values for the {{groups}} field:
{code:javascript}
{
  // ...
  groups: ["Common", "Transforms", "Common", "Transforms"]
}
{code}

This is a regression and may cause issues for any tooling that is using the 
Kafka Connect REST API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5452) Aggressive log compaction ratio appears to have no negative effect on log-compacted topics

2017-06-19 Thread Jeff Chao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Chao updated KAFKA-5452:
-
Description: 
Some of our users are seeing unintuitive/unexpected behavior with log-compacted 
topics where they receive multiple records for the same key when consuming. 
This is a result of low throughput on log-compacted topics such that conditions 
({{min.cleanable.dirty.ratio = 0.5}}, default) aren't met for compaction to 
kick in.

This prompted us to test and tune {{min.cleanable.dirty.ratio}} in our 
clusters. It appears that having more aggressive log compaction ratios don't 
have negative effects on CPU and memory utilization. If this is truly the case, 
we should consider changing the default from {{0.5}} to something more 
aggressive.

Setup:

# 8 brokers
# 5 zk nodes
# 32 partitions on a topic
# replication factor 3
# log roll 3 hours
# log segment bytes 1 GB
# log retention 24 hours
# all messages to a single key
# all messages to a unique key
# all messages to a bounded key range [0, 999]
# {{min.cleanable.dirty.ratio}} per topic = {{0}}, {{0.5}}, and {{1}}
# 200 MB/s sustained, produce and consume traffic

Observations:

We were able to verify log cleaner threads were performing work by checking the 
logs and verifying the {{cleaner-offset-checkpoint}} file for all topics. We 
also observed the log cleaner's {{time-since-last-run-ms}} metric was normal, 
never going above the default of 15 seconds.

Under-replicated partitions stayed steady, same for replication lag.

Here's an example test run where we try out {{min.cleanable.dirty.ratio = 0}}, 
{{min.cleanable.dirty.ratio = 1}}, and {{min.cleanable.dirty.ratio = 0.5}}. 
Troughs in between the peaks represent zero traffic and reconfiguring of topics.

(200mbs-dirty-0-dirty1-dirty05.png attached)
!200mbs-dirty0-dirty-1-dirty05.png|thumbnail!

Memory utilization is fine, but more interestingly, CPU doesn't appear to have 
much difference.

To get more detail, here is a flame graph (raw svg attached) of the run for 
{{min.cleanable.dirty.ratio = 0}}. The conservative and default ratio flame 
graphs are equivalent.

(flame-graph-200mbs-dirty0.png attached)
!flame-graph-200mbs-dirty0.png|thumbnail!

Notice that the majority of CPU is coming from:

# SSL operations (on reads/writes)
# KafkaApis::handleFetchRequest (ReplicaManager::fetchMessages)
# KafkaApis::handleOffsetFetchRequest

We also have examples from small scale test runs which show similar behavior 
but with scaled down CPU usage.

It seems counterintuitive that there's no apparent difference in CPU whether it 
be aggressive or conservative compaction ratios, so we'd like to get some 
thoughts from the community.

We're looking for feedback on whether or not anyone else has experienced this 
behavior before as well or, if CPU isn't affected, has anyone seen something 
related instead.

If this is true, then we'd be happy to discuss further and provide a patch.

  was:
Some of our users are seeing unintuitive/unexpected behavior with log-compacted 
topics where they receive multiple records for the same key when consuming. 
This is a result of low throughput on log-compacted topics such that conditions 
({{min.cleanable.dirty.ratio = 0.5}}, default) aren't met for compaction to 
kick in.

This prompted us to test and tune {{min.cleanable.dirty.ratio}} in our 
clusters. It appears that having more aggressive log compaction ratios don't 
have negative effects on CPU and memory utilization. If this is truly the case, 
we should consider changing the default from {{0.5}} to something more 
aggressive.

Setup:

# 1. 8 brokers
# 2. 5 zk nodes
# 3. 32 partitions on a topic
# 4. replication factor 3
# 5. log roll 3 hours
# 6. log segment bytes 1 GB
# 7. log retention 24 hours
# 8. all messages to a single key
# 9. all messages to a unique key
# 10. all messages to a bounded key range [0, 999]
# 11. {{min.cleanable.dirty.ratio}} per topic = {{0}}, {{0.5}}, and {{1}}
# 12. 200 MB/s sustained, produce and consume traffic

Observations:

We were able to verify log cleaner threads were performing work by checking the 
logs and verifying the {{cleaner-offset-checkpoint}} file for all topics. We 
also observed the log cleaner's {{time-since-last-run-ms}} metric was normal, 
never going above the default of 15 seconds.

Under-replicated partitions stayed steady, same for replication lag.

Here's an example test run where we try out {{min.cleanable.dirty.ratio = 0}}, 
{{min.cleanable.dirty.ratio = 1}}, and {{min.cleanable.dirty.ratio = 0.5}}. 
Troughs in between the peaks represent zero traffic and reconfiguring of topics.

(200mbs-dirty-0-dirty1-dirty05.png attached)
!200mbs-dirty0-dirty-1-dirty05.png|thumbnail!

Memory utilization is fine, but more interestingly, CPU doesn't appear to have 
much difference.

To get more detail, here is a flame graph (raw svg attached) of the run for 
{{min.cleanable.dirty.ratio 

[jira] [Commented] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException

2017-06-19 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054686#comment-16054686
 ] 

Jun Rao commented on KAFKA-5431:


[~crietz], in Log.roll(), we call LogSegment.trim() to reset the size of the 
log file to the actual size, which eventually calls FileRecords.truncateTo(). 
If this is reproducible, could you add some instrumentation in 
FileRecords.truncateTo() to see if the logic is actually called during log 
rolling?

> LogCleaner stopped due to 
> org.apache.kafka.common.errors.CorruptRecordException
> ---
>
> Key: KAFKA-5431
> URL: https://issues.apache.org/jira/browse/KAFKA-5431
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.2.1
>Reporter: Carsten Rietz
>  Labels: reliability
> Fix For: 0.11.0.1
>
>
> Hey all,
> i have a strange problem with our uat cluster of 3 kafka brokers.
> the __consumer_offsets topic was replicated to two instances and our disks 
> ran full due to a wrong configuration of the log cleaner. We fixed the 
> configuration and updated from 0.10.1.1 to 0.10.2.1 .
> Today i increased the replication of the __consumer_offsets topic to 3 and 
> triggered replication to the third cluster via kafka-reassign-partitions.sh. 
> That went well but i get many errors like
> {code}
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for 
> partition [__consumer_offsets,18] offset 0 error Record size is less than the 
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for 
> partition [__consumer_offsets,24] offset 0 error Record size is less than the 
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> {code}
> Which i think are due to the full disk event.
> The log cleaner threads died on these wrong messages:
> {code}
> [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is less 
> than the minimum record overhead (14)
> [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped  
> (kafka.log.LogCleaner)
> {code}
> Looking at the file is see that some are truncated and some are jsut empty:
> $ ls -lsh 00594653.log
> 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log
> Sadly i do not have the logs any more from the disk full event itsself.
> I have three questions:
> * What is the best way to clean this up? Deleting the old log files and 
> restarting the brokers?
> * Why did kafka not handle the disk full event well? Is this only affecting 
> the cleanup or may we also loose data?
> * Is this maybe caused by the combination of upgrade and disk full?
> And last but not least: Keep up the good work. Kafka is really performing 
> well while being easy to administer and has good documentation!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5032) Think through implications of max.message.size affecting record batches in message format V2

2017-06-19 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5032:
---

Assignee: Apurva Mehta

> Think through implications of max.message.size affecting record batches in 
> message format V2
> 
>
> Key: KAFKA-5032
> URL: https://issues.apache.org/jira/browse/KAFKA-5032
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> It's worth noting that the new behaviour for uncompressed messages is the 
> same as the existing behaviour for compressed messages.
> A few things to think about:
> 1. Do the producer settings max.request.size and batch.size still make sense 
> and do we need to update the documentation? My conclusion is that things are 
> still fine, but we may need to revise the docs.
> 2. (Seems like we don't need to do this) Consider changing default max 
> message set size to include record batch overhead. This is currently defined 
> as:
> {code}
> val MessageMaxBytes = 100 + MessageSet.LogOverhead
> {code}
> We should consider changing it to (I haven't thought it through though):
> {code}
> val MessageMaxBytes = 100 + DefaultRecordBatch.RECORD_BATCH_OVERHEAD
> {code}
> 3. When a record batch is too large, we throw RecordTooLargeException, which 
> is confusing because there's also a RecordBatchTooLargeException. We should 
> consider renaming these exceptions to make the behaviour clearer.
> 4. We should consider deprecating max.message.bytes (server config) and 
> message.max.bytes (topic config) in favour of configs that make it clear that 
> we are talking about record batches instead of individual messages.
> Part of the work in this JIRA is working out what should be done for 0.11.0.0 
> and what can be done later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5348) kafka-consumer-groups.sh refuses to remove groups without ids

2017-06-19 Thread Vahid Hashemian (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054545#comment-16054545
 ] 

Vahid Hashemian commented on KAFKA-5348:


[~bobrik] Then perhaps ZooKeeper data for this group is somehow corrupted?!
Because normally, even if all consumers in the group die the {{/ids}} path 
would still be there.
Is this something that you can easily reproduce?

> kafka-consumer-groups.sh refuses to remove groups without ids
> -
>
> Key: KAFKA-5348
> URL: https://issues.apache.org/jira/browse/KAFKA-5348
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 0.10.2.0
>Reporter: Ivan Babrou
>Assignee: Vahid Hashemian
>
> In zookeeper I have:
> {noformat}
> [zk: foo(CONNECTED) 37] ls /kafka/logs/consumers/console-consumer-4107
> [offsets]
> {noformat}
> This consumer group also shows up when I list consumer groups:
> {noformat}
> $ /usr/local/kafka/bin/kafka-consumer-groups.sh --zookeeper 
> foo:2181/kafka/logs --list | fgrep console-consumer-4107
> Note: This will only show information about consumers that use ZooKeeper (not 
> those using the Java consumer API).
> console-consumer-4107
> {noformat}
> But I cannot remove this group:
> {noformat}
> $ /usr/local/kafka/bin/kafka-consumer-groups.sh --zookeeper 
> 36zk1.in.pdx.cfdata.org:2181/kafka/logs --delete --group console-consumer-4107
> Note: This will only show information about consumers that use ZooKeeper (not 
> those using the Java consumer API).
> Error: Delete for group 'console-consumer-4107' failed because group does not 
> exist.
> {noformat}
> I ran tcpdump and it turns out that /ids path is checked:
> {noformat}
> $.e.P.fP...&..<...//kafka/logs/consumers/console-consumer-4107/ids.
> {noformat}
> I think kafka should not check for /ids, it should check for / instead here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5465) FetchResponse v0 does not return any messages when max_bytes smaller than v2 message set

2017-06-19 Thread Dana Powers (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054422#comment-16054422
 ] 

Dana Powers commented on KAFKA-5465:


Agree on won't fix.

> FetchResponse v0 does not return any messages when max_bytes smaller than v2 
> message set 
> -
>
> Key: KAFKA-5465
> URL: https://issues.apache.org/jira/browse/KAFKA-5465
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Dana Powers
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.11.0.1
>
>
> In prior releases, when consuming uncompressed messages, FetchResponse v0 
> will return a message if it is smaller than the max_bytes sent in the 
> FetchRequest. In 0.11.0.0 RC0, when messages are stored as v2 internally, the 
> response will be empty unless the full MessageSet is smaller than max_bytes. 
> In some configurations, this may cause old consumers to get stuck on large 
> messages where previously they were able to make progress one message at a 
> time.
> For example, when I produce 10 5KB messages using ProduceRequest v0 and then 
> attempt FetchRequest v0 with partition max bytes = 6KB (larger than a single 
> message but smaller than all 10 messages together), I get an empty message 
> set from 0.11.0.0. Previous brokers would have returned a single message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-5465) FetchResponse v0 does not return any messages when max_bytes smaller than v2 message set

2017-06-19 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-5465.

Resolution: Won't Fix

For the reasons mentioned by Ismael above, we've decided not to fix this for 
now. If anyone has concerns, please leave a comment and we will reconsider.

> FetchResponse v0 does not return any messages when max_bytes smaller than v2 
> message set 
> -
>
> Key: KAFKA-5465
> URL: https://issues.apache.org/jira/browse/KAFKA-5465
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Dana Powers
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.11.0.1
>
>
> In prior releases, when consuming uncompressed messages, FetchResponse v0 
> will return a message if it is smaller than the max_bytes sent in the 
> FetchRequest. In 0.11.0.0 RC0, when messages are stored as v2 internally, the 
> response will be empty unless the full MessageSet is smaller than max_bytes. 
> In some configurations, this may cause old consumers to get stuck on large 
> messages where previously they were able to make progress one message at a 
> time.
> For example, when I produce 10 5KB messages using ProduceRequest v0 and then 
> attempt FetchRequest v0 with partition max bytes = 6KB (larger than a single 
> message but smaller than all 10 messages together), I get an empty message 
> set from 0.11.0.0. Previous brokers would have returned a single message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-2967) Move Kafka documentation to ReStructuredText

2017-06-19 Thread Tom Bentley (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054289#comment-16054289
 ] 

Tom Bentley commented on KAFKA-2967:


Any progress on this [[~ceposta], [[~gwenshap]? I would like to help improve 
the documentation, and not having to edit the raw HTML would make that a nicer 
experience.

> Move Kafka documentation to ReStructuredText
> 
>
> Key: KAFKA-2967
> URL: https://issues.apache.org/jira/browse/KAFKA-2967
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>Assignee: Gwen Shapira
>
> Storing documentation as HTML is kind of BS :)
> * Formatting is a pain, and making it look good is even worse
> * Its just HTML, can't generate PDFs
> * Reading and editting is painful
> * Validating changes is hard because our formatting relies on all kinds of 
> Apache Server features.
> I suggest:
> * Move to RST
> * Generate HTML and PDF during build using Sphinx plugin for Gradle.
> Lots of Apache projects are doing this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5412) Using connect-console-sink/source.properties raises an exception related to "file" property not found

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054278#comment-16054278
 ] 

ASF GitHub Bot commented on KAFKA-5412:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/3279


> Using connect-console-sink/source.properties raises an exception related to 
> "file" property not found
> -
>
> Key: KAFKA-5412
> URL: https://issues.apache.org/jira/browse/KAFKA-5412
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.10.2.1
>Reporter: Paolo Patierno
>Assignee: Paolo Patierno
> Fix For: 0.11.1.0
>
>
> With the latest 0.11.1.0-SNAPSHOT it happens that the Kafka Connect example 
> using connect-console-sink/source.properties doesn't work anymore because the 
> needed "file" property isn't found. This is because the underlying used 
> FileStreamSink/Source connector and task has defined a ConfigDef with "file" 
> as mandatory parameter. In the case of console example we want to have 
> file=null so that stdin and stdout are used. 
> One possible solution and workaround is set "file=" inside the provided 
> connect-console-sink/source.properties. The other one could be modify the 
> FileStreamSink/Source source code in order to remove the "file" definition 
> from the ConfigDef.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-5470) Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in kafka-run-class

2017-06-19 Thread Hongyuan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054216#comment-16054216
 ] 

Hongyuan Li edited comment on KAFKA-5470 at 6/19/17 3:32 PM:
-

LGTM. That decreases the rate of invoking oom.  
Thanks, [~ijuma].


was (Author: hongyuan li):
LGTM. That decreases the rate of invoking oom.  
Thanks, Ismael Juma.

> Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in 
> kafka-run-class
> ---
>
> Key: KAFKA-5470
> URL: https://issues.apache.org/jira/browse/KAFKA-5470
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Critical
> Fix For: 0.11.0.1
>
>
> This is important because Bits.reserveMemory calls System.gc() hoping to free 
> native memory in order to avoid throwing an OutOfMemoryException. This call 
> is currently a no-op due to -XX:+DisableExplicitGC.
> It's worth mentioning that -XX:MaxDirectMemorySize can be used to increase 
> the amount of native memory available for allocation of direct byte buffers.
> Hongyuan Li pointed out the issue with the usage of -XX:+DisableExplicitGC 
> in: 
> https://issues.apache.org/jira/browse/KAFKA-5444?focusedCommentId=16054129&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16054129



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5470) Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in kafka-run-class

2017-06-19 Thread Hongyuan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054216#comment-16054216
 ] 

Hongyuan Li commented on KAFKA-5470:


LGTM. That decreases the rate of invoking oom.  
Thanks, Ismael Juma.

> Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in 
> kafka-run-class
> ---
>
> Key: KAFKA-5470
> URL: https://issues.apache.org/jira/browse/KAFKA-5470
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Critical
> Fix For: 0.11.0.1
>
>
> This is important because Bits.reserveMemory calls System.gc() hoping to free 
> native memory in order to avoid throwing an OutOfMemoryException. This call 
> is currently a no-op due to -XX:+DisableExplicitGC.
> It's worth mentioning that -XX:MaxDirectMemorySize can be used to increase 
> the amount of native memory available for allocation of direct byte buffers.
> Hongyuan Li pointed out the issue with the usage of -XX:+DisableExplicitGC 
> in: 
> https://issues.apache.org/jira/browse/KAFKA-5444?focusedCommentId=16054129&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16054129



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5471) Original Kafka paper link broken

2017-06-19 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054211#comment-16054211
 ] 

Jeremy Hanna commented on KAFKA-5471:
-

I can get it done if I could get confluence access for my user - jeromatron.

> Original Kafka paper link broken
> 
>
> Key: KAFKA-5471
> URL: https://issues.apache.org/jira/browse/KAFKA-5471
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Reporter: Jeremy Hanna
>Priority: Trivial
>
> Currently on the [Kafka papers and presentations 
> site|https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations]
>  the original Kafka paper is linked but it's a broken link.
> Currently it links to 
> [here|http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf]
>  but that person may have taken the paper down.  I found it 
> [here|http://notes.stephenholiday.com/Kafka.pdf] but that could have a 
> similar problem in the future.  We should be able to put the file as an 
> attachment in the confluence wiki to make it a more permanent link.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5471) Original Kafka paper link broken

2017-06-19 Thread Jeremy Hanna (JIRA)
Jeremy Hanna created KAFKA-5471:
---

 Summary: Original Kafka paper link broken
 Key: KAFKA-5471
 URL: https://issues.apache.org/jira/browse/KAFKA-5471
 Project: Kafka
  Issue Type: Bug
  Components: documentation
Reporter: Jeremy Hanna
Priority: Trivial


Currently on the [Kafka papers and presentations 
site|https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations]
 the original Kafka paper is linked but it's a broken link.

Currently it links to 
[here|http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf]
 but that person may have taken the paper down.  I found it 
[here|http://notes.stephenholiday.com/Kafka.pdf] but that could have a similar 
problem in the future.  We should be able to put the file as an attachment in 
the confluence wiki to make it a more permanent link.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5444) Producer.send() will hang 8+ hours

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054185#comment-16054185
 ] 

Ismael Juma commented on KAFKA-5444:


That's a good point [~Hongyuan Li], I filed KAFKA-5470 with a PR that replaces 
-XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent

> Producer.send() will hang 8+ hours
> --
>
> Key: KAFKA-5444
> URL: https://issues.apache.org/jira/browse/KAFKA-5444
> Project: Kafka
>  Issue Type: Bug
>Reporter: Hongyuan Li
>
> Frequent kafka old Producer open and close with cause the server hang with 
> lots of error messages logged in the server.log .In my occasion,we may 
> frequent open and close kafka producer,the procedure just like the code below:
> {code}
> Producer producer = ……
> producer.send(List lists);
> producer.close();
> {code}
> the error is below:
> {code}
> 2017-06-13 00:00:00,084] ERROR Processor got uncaught exception. 
> (kafka.network.Processor)
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:195)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:110)
>   at 
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
>   at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
>   at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:154)
>   at 
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:135)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:343)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
>   at kafka.network.Processor.poll(SocketServer.scala:476)
>   at kafka.network.Processor.run(SocketServer.scala:416)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> seen from all existing logs, all error is repeats of error above.
> Any good idea to solve this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5470) Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in kafka-run-class

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054179#comment-16054179
 ] 

ASF GitHub Bot commented on KAFKA-5470:
---

GitHub user ijuma opened a pull request:

https://github.com/apache/kafka/pull/3371

KAFKA-5470: Replace -XX:+DisableExplicitGC with 
-XX:+ExplicitGCInvokesConcurrent in kafka-run-class



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ijuma/kafka 
kafka-5470-explicit-gc-invokes-concurrent

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3371.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3371


commit da81116ca8d5c9c75ea447c0febe11f97a67957a
Author: Ismael Juma 
Date:   2017-06-19T15:17:46Z

KAFKA-5470: Replace -XX:+DisableExplicitGC with 
-XX:+ExplicitGCInvokesConcurrent in kafka-run-class




> Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in 
> kafka-run-class
> ---
>
> Key: KAFKA-5470
> URL: https://issues.apache.org/jira/browse/KAFKA-5470
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Critical
> Fix For: 0.11.0.1
>
>
> This is important because Bits.reserveMemory calls System.gc() hoping to free 
> native memory in order to avoid throwing an OutOfMemoryException. This call 
> is currently a no-op due to -XX:+DisableExplicitGC.
> It's worth mentioning that -XX:MaxDirectMemorySize can be used to increase 
> the amount of native memory available for allocation of direct byte buffers.
> Hongyuan Li pointed out the issue with the usage of -XX:+DisableExplicitGC 
> in: 
> https://issues.apache.org/jira/browse/KAFKA-5444?focusedCommentId=16054129&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16054129



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5470) Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in kafka-run-class

2017-06-19 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma reassigned KAFKA-5470:
--

 Assignee: Ismael Juma
Fix Version/s: 0.11.0.1

> Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in 
> kafka-run-class
> ---
>
> Key: KAFKA-5470
> URL: https://issues.apache.org/jira/browse/KAFKA-5470
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Critical
> Fix For: 0.11.0.1
>
>
> This is important because Bits.reserveMemory calls System.gc() hoping to free 
> native memory in order to avoid throwing an OutOfMemoryException. This call 
> is currently a no-op due to -XX:+DisableExplicitGC.
> It's worth mentioning that -XX:MaxDirectMemorySize can be used to increase 
> the amount of native memory available for allocation of direct byte buffers.
> Hongyuan Li pointed out the issue with the usage of -XX:+DisableExplicitGC 
> in: 
> https://issues.apache.org/jira/browse/KAFKA-5444?focusedCommentId=16054129&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16054129



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5470) Replace -XX:+DisableExplicitGC with -XX:+ExplicitGCInvokesConcurrent in kafka-run-class

2017-06-19 Thread Ismael Juma (JIRA)
Ismael Juma created KAFKA-5470:
--

 Summary: Replace -XX:+DisableExplicitGC with 
-XX:+ExplicitGCInvokesConcurrent in kafka-run-class
 Key: KAFKA-5470
 URL: https://issues.apache.org/jira/browse/KAFKA-5470
 Project: Kafka
  Issue Type: Improvement
Reporter: Ismael Juma
Priority: Critical


This is important because Bits.reserveMemory calls System.gc() hoping to free 
native memory in order to avoid throwing an OutOfMemoryException. This call is 
currently a no-op due to -XX:+DisableExplicitGC.

It's worth mentioning that -XX:MaxDirectMemorySize can be used to increase the 
amount of native memory available for allocation of direct byte buffers.

Hongyuan Li pointed out the issue with the usage of -XX:+DisableExplicitGC in: 
https://issues.apache.org/jira/browse/KAFKA-5444?focusedCommentId=16054129&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16054129



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5444) Producer.send() will hang 8+ hours

2017-06-19 Thread Hongyuan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054129#comment-16054129
 ] 

Hongyuan Li commented on KAFKA-5444:


I think this problem is associate with direct buffer allocation, use new 
producer api may ecountered similar problem which may not hang. I thought we 
can solve this in two steps:
1、move {{XX:+DisableExplicitGC}} from default {{kafka-run-class.sh}}
2、use {{sun.misc.cleaner}} to clean the directbuffer or use {{netty}} 
{{PlatformDependent}}#{{freeDirectBuffer}} to clean the direct buffer.

Is that right?

> Producer.send() will hang 8+ hours
> --
>
> Key: KAFKA-5444
> URL: https://issues.apache.org/jira/browse/KAFKA-5444
> Project: Kafka
>  Issue Type: Bug
>Reporter: Hongyuan Li
>
> Frequent kafka old Producer open and close with cause the server hang with 
> lots of error messages logged in the server.log .In my occasion,we may 
> frequent open and close kafka producer,the procedure just like the code below:
> {code}
> Producer producer = ……
> producer.send(List lists);
> producer.close();
> {code}
> the error is below:
> {code}
> 2017-06-13 00:00:00,084] ERROR Processor got uncaught exception. 
> (kafka.network.Processor)
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:195)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:110)
>   at 
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
>   at 
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
>   at 
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:154)
>   at 
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:135)
>   at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:343)
>   at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
>   at kafka.network.Processor.poll(SocketServer.scala:476)
>   at kafka.network.Processor.run(SocketServer.scala:416)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> seen from all existing logs, all error is repeats of error above.
> Any good idea to solve this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5021) Update Message Delivery Semantics section to take into account KIP-98

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054046#comment-16054046
 ] 

Ismael Juma commented on KAFKA-5021:


Moving back to 0.11.0.0. This is a doc-only change and does not require a new 
RC.

> Update Message Delivery Semantics section to take into account KIP-98
> -
>
> Key: KAFKA-5021
> URL: https://issues.apache.org/jira/browse/KAFKA-5021
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> Reference:
> https://kafka.apache.org/documentation/#semantics



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5020) Update protocol documentation to mention message format v2

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054047#comment-16054047
 ] 

Ismael Juma commented on KAFKA-5020:


Moving back to 0.11.0.0. This is a doc-only change and does not require a new 
RC.

> Update protocol documentation to mention message format v2
> --
>
> Key: KAFKA-5020
> URL: https://issues.apache.org/jira/browse/KAFKA-5020
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> Sections 5.3, 5.4 and 5.5 should be updated:
> https://kafka.apache.org/documentation/#messages
> We may want to mention record batches along with message sets here:
> https://kafka.apache.org/protocol#protocol_message_sets
> And we should update the wiki page linked from the protocol documentation:
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5020) Update protocol documentation to mention message format v2

2017-06-19 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-5020:
---
Fix Version/s: (was: 0.11.0.1)
   0.11.0.0

> Update protocol documentation to mention message format v2
> --
>
> Key: KAFKA-5020
> URL: https://issues.apache.org/jira/browse/KAFKA-5020
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> Sections 5.3, 5.4 and 5.5 should be updated:
> https://kafka.apache.org/documentation/#messages
> We may want to mention record batches along with message sets here:
> https://kafka.apache.org/protocol#protocol_message_sets
> And we should update the wiki page linked from the protocol documentation:
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5021) Update Message Delivery Semantics section to take into account KIP-98

2017-06-19 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-5021:
---
Fix Version/s: (was: 0.11.0.1)
   0.11.0.0

> Update Message Delivery Semantics section to take into account KIP-98
> -
>
> Key: KAFKA-5021
> URL: https://issues.apache.org/jira/browse/KAFKA-5021
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Jason Gustafson
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> Reference:
> https://kafka.apache.org/documentation/#semantics



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5032) Think through implications of max.message.size affecting record batches in message format V2

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054045#comment-16054045
 ] 

Ismael Juma commented on KAFKA-5032:


Moving back to 0.11.0.0. This is a doc-only change and does not require a new 
RC.

> Think through implications of max.message.size affecting record batches in 
> message format V2
> 
>
> Key: KAFKA-5032
> URL: https://issues.apache.org/jira/browse/KAFKA-5032
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> It's worth noting that the new behaviour for uncompressed messages is the 
> same as the existing behaviour for compressed messages.
> A few things to think about:
> 1. Do the producer settings max.request.size and batch.size still make sense 
> and do we need to update the documentation? My conclusion is that things are 
> still fine, but we may need to revise the docs.
> 2. (Seems like we don't need to do this) Consider changing default max 
> message set size to include record batch overhead. This is currently defined 
> as:
> {code}
> val MessageMaxBytes = 100 + MessageSet.LogOverhead
> {code}
> We should consider changing it to (I haven't thought it through though):
> {code}
> val MessageMaxBytes = 100 + DefaultRecordBatch.RECORD_BATCH_OVERHEAD
> {code}
> 3. When a record batch is too large, we throw RecordTooLargeException, which 
> is confusing because there's also a RecordBatchTooLargeException. We should 
> consider renaming these exceptions to make the behaviour clearer.
> 4. We should consider deprecating max.message.bytes (server config) and 
> message.max.bytes (topic config) in favour of configs that make it clear that 
> we are talking about record batches instead of individual messages.
> Part of the work in this JIRA is working out what should be done for 0.11.0.0 
> and what can be done later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5032) Think through implications of max.message.size affecting record batches in message format V2

2017-06-19 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-5032:
---
Fix Version/s: (was: 0.11.0.1)
   0.11.0.0

> Think through implications of max.message.size affecting record batches in 
> message format V2
> 
>
> Key: KAFKA-5032
> URL: https://issues.apache.org/jira/browse/KAFKA-5032
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> It's worth noting that the new behaviour for uncompressed messages is the 
> same as the existing behaviour for compressed messages.
> A few things to think about:
> 1. Do the producer settings max.request.size and batch.size still make sense 
> and do we need to update the documentation? My conclusion is that things are 
> still fine, but we may need to revise the docs.
> 2. (Seems like we don't need to do this) Consider changing default max 
> message set size to include record batch overhead. This is currently defined 
> as:
> {code}
> val MessageMaxBytes = 100 + MessageSet.LogOverhead
> {code}
> We should consider changing it to (I haven't thought it through though):
> {code}
> val MessageMaxBytes = 100 + DefaultRecordBatch.RECORD_BATCH_OVERHEAD
> {code}
> 3. When a record batch is too large, we throw RecordTooLargeException, which 
> is confusing because there's also a RecordBatchTooLargeException. We should 
> consider renaming these exceptions to make the behaviour clearer.
> 4. We should consider deprecating max.message.bytes (server config) and 
> message.max.bytes (topic config) in favour of configs that make it clear that 
> we are talking about record batches instead of individual messages.
> Part of the work in this JIRA is working out what should be done for 0.11.0.0 
> and what can be done later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4985) kafka-acls should resolve dns names and accept ip ranges

2017-06-19 Thread Tom Bentley (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053942#comment-16053942
 ] 

Tom Bentley commented on KAFKA-4985:


> The problem with resolving hostnames client-side is that it would cause a lot 
> of confusion when resolution happened differently client-side versus 
> server-side.

That argument could be applied to practically any use of DNS, so I'm not 
convinced it makes a good reason not to do this.

> kafka-acls should resolve dns names and accept ip ranges
> 
>
> Key: KAFKA-4985
> URL: https://issues.apache.org/jira/browse/KAFKA-4985
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Reporter: Ryan P
>
> Per KAFKA-2869 it looks like a conscious decision was made to move away from 
> using hostnames for authorization purposes. 
> This is fine however IP addresses are terrible inconvenient compared to 
> hostname with regard to configuring ACLs. 
> I'd like to propose the following two improvements to make managing these 
> ACLs easier for end-users. 
> 1. Allow for simple patterns to be matched 
> i.e --allow-host 10.17.81.11[1-9] 
> 2. Allow for hostnames to be used even if they are resolved on the client 
> side. Simple pattern matching on hostnames would be a welcome addition as well
> i.e. --allow-host host.name.com
> Accepting a comma delimited list of hostnames and ip addresses would also be 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5458) SASL handshake fails for METADATA requests

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053938#comment-16053938
 ] 

Ismael Juma commented on KAFKA-5458:


Yes, the client config for JAAS is only supported by the 0.10.2 Java clients.

> SASL handshake fails for METADATA requests
> --
>
> Key: KAFKA-5458
> URL: https://issues.apache.org/jira/browse/KAFKA-5458
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.10.1.1
>Reporter: Sagar Lad
> Attachments: kafka_client_jaas.conf, kafka_server_jaas.conf, 
> producer.properties, server.properties
>
>
> i have enabled SASL Authentication for my kafka with 
> {code:java}
> listeners=SASL_PLAINTEXT://0.0.0.0:9092
> advertised.listeners=SASL_PLAINTEXT://0.0.0.0:9092
> sasl.enabled.mechanisms=PLAIN
> security.protocol=SASL_PLAINTEXT
> security.inter.broker.protocol=SASL_PLAINTEXT
> sasl.mechanism=PLAIN
> sasl.mechanism.inter.broker.protocol=PLAIN
> {code}
> but in kafkaServer.out getting
> {code:java}
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to HANDSHAKE_REQUEST 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Handle Kafka request METADATA 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to FAILED 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Connection with /127.0.0.1 disconnected 
> (org.apache.kafka.common.network.Selector)
> java.io.IOException: 
> org.apache.kafka.common.errors.IllegalSaslStateException: Unexpected Kafka 
> request of type METADATA during SASL handshake.
> at 
> org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:243)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:64)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:338)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
> at kafka.network.Processor.poll(SocketServer.scala:476)
> at kafka.network.Processor.run(SocketServer.scala:416)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> where as in client log i observed 
> {code:java}
>  2017-06-16 11:33:33,501 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,607 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,721 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,834 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,952 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:34,059 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:34,162 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
> ###
> {code}
> metadata request are being every 100ms
> *Problem* : 
> how can we make SASL handshake to support METADATA type requests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5458) SASL handshake fails for METADATA requests

2017-06-19 Thread Sagar Lad (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053934#comment-16053934
 ] 

Sagar Lad commented on KAFKA-5458:
--

is that supported in Kafka 0.10.2  ?
because i am using 0.10.1.1


> SASL handshake fails for METADATA requests
> --
>
> Key: KAFKA-5458
> URL: https://issues.apache.org/jira/browse/KAFKA-5458
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.10.1.1
>Reporter: Sagar Lad
> Attachments: kafka_client_jaas.conf, kafka_server_jaas.conf, 
> producer.properties, server.properties
>
>
> i have enabled SASL Authentication for my kafka with 
> {code:java}
> listeners=SASL_PLAINTEXT://0.0.0.0:9092
> advertised.listeners=SASL_PLAINTEXT://0.0.0.0:9092
> sasl.enabled.mechanisms=PLAIN
> security.protocol=SASL_PLAINTEXT
> security.inter.broker.protocol=SASL_PLAINTEXT
> sasl.mechanism=PLAIN
> sasl.mechanism.inter.broker.protocol=PLAIN
> {code}
> but in kafkaServer.out getting
> {code:java}
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to HANDSHAKE_REQUEST 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Handle Kafka request METADATA 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to FAILED 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Connection with /127.0.0.1 disconnected 
> (org.apache.kafka.common.network.Selector)
> java.io.IOException: 
> org.apache.kafka.common.errors.IllegalSaslStateException: Unexpected Kafka 
> request of type METADATA during SASL handshake.
> at 
> org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:243)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:64)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:338)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
> at kafka.network.Processor.poll(SocketServer.scala:476)
> at kafka.network.Processor.run(SocketServer.scala:416)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> where as in client log i observed 
> {code:java}
>  2017-06-16 11:33:33,501 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,607 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,721 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,834 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,952 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:34,059 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:34,162 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
> ###
> {code}
> metadata request are being every 100ms
> *Problem* : 
> how can we make SASL handshake to support METADATA type requests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException

2017-06-19 Thread Carsten Rietz (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053936#comment-16053936
 ] 

Carsten Rietz commented on KAFKA-5431:
--

Yes the preallocatiion size is 100MB. I set the roll time to 1s to provoke many 
log rolls to more easily trigger this problem. Therefore only one offset makes 
it into a new log. The segment size is normal (log.segment.bytes=10485760).

> LogCleaner stopped due to 
> org.apache.kafka.common.errors.CorruptRecordException
> ---
>
> Key: KAFKA-5431
> URL: https://issues.apache.org/jira/browse/KAFKA-5431
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.2.1
>Reporter: Carsten Rietz
>  Labels: reliability
> Fix For: 0.11.0.1
>
>
> Hey all,
> i have a strange problem with our uat cluster of 3 kafka brokers.
> the __consumer_offsets topic was replicated to two instances and our disks 
> ran full due to a wrong configuration of the log cleaner. We fixed the 
> configuration and updated from 0.10.1.1 to 0.10.2.1 .
> Today i increased the replication of the __consumer_offsets topic to 3 and 
> triggered replication to the third cluster via kafka-reassign-partitions.sh. 
> That went well but i get many errors like
> {code}
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for 
> partition [__consumer_offsets,18] offset 0 error Record size is less than the 
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for 
> partition [__consumer_offsets,24] offset 0 error Record size is less than the 
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> {code}
> Which i think are due to the full disk event.
> The log cleaner threads died on these wrong messages:
> {code}
> [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is less 
> than the minimum record overhead (14)
> [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped  
> (kafka.log.LogCleaner)
> {code}
> Looking at the file is see that some are truncated and some are jsut empty:
> $ ls -lsh 00594653.log
> 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log
> Sadly i do not have the logs any more from the disk full event itsself.
> I have three questions:
> * What is the best way to clean this up? Deleting the old log files and 
> restarting the brokers?
> * Why did kafka not handle the disk full event well? Is this only affecting 
> the cleanup or may we also loose data?
> * Is this maybe caused by the combination of upgrade and disk full?
> And last but not least: Keep up the good work. Kafka is really performing 
> well while being easy to administer and has good documentation!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5458) SASL handshake fails for METADATA requests

2017-06-19 Thread Edoardo Comar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053896#comment-16053896
 ] 

Edoardo Comar commented on KAFKA-5458:
--

looks like your properties file does not correctly set the property 
{{sasl.jaas.config}} 
note that using that you must not set the JVm system property

> SASL handshake fails for METADATA requests
> --
>
> Key: KAFKA-5458
> URL: https://issues.apache.org/jira/browse/KAFKA-5458
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.10.1.1
>Reporter: Sagar Lad
> Attachments: kafka_client_jaas.conf, kafka_server_jaas.conf, 
> producer.properties, server.properties
>
>
> i have enabled SASL Authentication for my kafka with 
> {code:java}
> listeners=SASL_PLAINTEXT://0.0.0.0:9092
> advertised.listeners=SASL_PLAINTEXT://0.0.0.0:9092
> sasl.enabled.mechanisms=PLAIN
> security.protocol=SASL_PLAINTEXT
> security.inter.broker.protocol=SASL_PLAINTEXT
> sasl.mechanism=PLAIN
> sasl.mechanism.inter.broker.protocol=PLAIN
> {code}
> but in kafkaServer.out getting
> {code:java}
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to HANDSHAKE_REQUEST 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Handle Kafka request METADATA 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to FAILED 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Connection with /127.0.0.1 disconnected 
> (org.apache.kafka.common.network.Selector)
> java.io.IOException: 
> org.apache.kafka.common.errors.IllegalSaslStateException: Unexpected Kafka 
> request of type METADATA during SASL handshake.
> at 
> org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:243)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:64)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:338)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
> at kafka.network.Processor.poll(SocketServer.scala:476)
> at kafka.network.Processor.run(SocketServer.scala:416)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> where as in client log i observed 
> {code:java}
>  2017-06-16 11:33:33,501 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,607 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,721 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,834 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,952 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:34,059 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:34,162 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
> ###
> {code}
> metadata request are being every 100ms
> *Problem* : 
> how can we make SASL handshake to support METADATA type requests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-5458) SASL handshake fails for METADATA requests

2017-06-19 Thread Sagar Lad (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053452#comment-16053452
 ] 

Sagar Lad edited comment on KAFKA-5458 at 6/19/17 12:14 PM:


Hi [~ecomar],
i tried using properties files as well instead of jaas.conf but now getting :

{code:java}
Caused by: org.apache.kafka.common.KafkaException: 
java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the 
JAAS configuration. System property 'java.security.auth.login.config' is not set
at 
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:86)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
at 
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
at 
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:85)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
at 
org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:302)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
... 74 more
{code}





was (Author: sagar lad):
i tried using properties files as well instead of jaas.conf but now getting :

{code:java}
Caused by: org.apache.kafka.common.KafkaException: 
java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the 
JAAS configuration. System property 'java.security.auth.login.config' is not set
at 
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:86)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
at 
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
at 
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:85)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
at 
org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:302)[352:wrap_mvn_org.apache.kafka_kafka-clients_0.10.1.1:0.0.0]
... 74 more
{code}




> SASL handshake fails for METADATA requests
> --
>
> Key: KAFKA-5458
> URL: https://issues.apache.org/jira/browse/KAFKA-5458
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.10.1.1
>Reporter: Sagar Lad
> Attachments: kafka_client_jaas.conf, kafka_server_jaas.conf, 
> producer.properties, server.properties
>
>
> i have enabled SASL Authentication for my kafka with 
> {code:java}
> listeners=SASL_PLAINTEXT://0.0.0.0:9092
> advertised.listeners=SASL_PLAINTEXT://0.0.0.0:9092
> sasl.enabled.mechanisms=PLAIN
> security.protocol=SASL_PLAINTEXT
> security.inter.broker.protocol=SASL_PLAINTEXT
> sasl.mechanism=PLAIN
> sasl.mechanism.inter.broker.protocol=PLAIN
> {code}
> but in kafkaServer.out getting
> {code:java}
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to HANDSHAKE_REQUEST 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Handle Kafka request METADATA 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Set SASL server state to FAILED 
> (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator)
> [2017-06-16 11:21:12,167] DEBUG Connection with /127.0.0.1 disconnected 
> (org.apache.kafka.common.network.Selector)
> java.io.IOException: 
> org.apache.kafka.common.errors.IllegalSaslStateException: Unexpected Kafka 
> request of type METADATA during SASL handshake.
> at 
> org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:243)
> at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:64)
> at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:338)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
> at kafka.network.Processor.poll(SocketServer.scala:476)
> at kafka.network.Processor.run(SocketServer.scala:416)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> where as in client log i observed 
> {code:java}
>  2017-06-16 11:33:33,501 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,607 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:568)   
>   - Bootstrap broker 127.0.0.1:9092 disconnected
>  2017-06-16 11:33:33,721 WARN  [kafka-producer-network-thread | 
> producer-7] NetworkClient.maybeHandleDisconnection (NetworkClient.java:

[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-06-19 Thread Pengwei (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053852#comment-16053852
 ] 

Pengwei commented on KAFKA-4900:


[~ijuma] OK. I see the code in the trunk,  after send failed, it will call 
handleIllegalState, and then will call onControllerResignation() before setting 
the leaderId to -1, it seems this PR can fix this issue.

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for 
> a number of topics, which appeared to get stuck in the "in progress" state 
> (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following 
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener 
> on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException: java

[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053830#comment-16053830
 ] 

Ismael Juma commented on KAFKA-4900:


No, #2935 is only in trunk and 0.11.0 branches and it's not for KAFKA-3761 (it 
did remove something that has been left behind by KAFKA-3761).

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for 
> a number of topics, which appeared to get stuck in the "in progress" state 
> (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following 
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener 
> on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
> a

[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-06-19 Thread Pengwei (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053813#comment-16053813
 ] 

Pengwei commented on KAFKA-4900:


[~ijuma] This PR is for the https://issues.apache.org/jira/browse/KAFKA-3761, 
and  Kafka-3761 is fixed in 0.10.1.0.
But I found that in the 0.10.2.0, the ZookeeperLeaderElector.resign still 
setting the leaderId to -1

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for 
> a number of topics, which appeared to get stuck in the "in progress" state 
> (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following 
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener 
> on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException:

[jira] [Commented] (KAFKA-5469) Created state changelog topics not logged correctly

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053801#comment-16053801
 ] 

ASF GitHub Bot commented on KAFKA-5469:
---

GitHub user ppatierno opened a pull request:

https://github.com/apache/kafka/pull/3368

KAFKA-5469: Created state changelog topics not logged correctly

Fixed debug logging for the created state changelog topics
Added toString() for InternalTopicMetadata and InternalTopicConfig for 
above debug logging

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppatierno/kafka kafka-5469

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3368.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3368


commit 33d1ce3abdd99c98a8d878d266849347d1f20807
Author: ppatierno 
Date:   2017-06-19T10:56:35Z

Fixed debug loggin for the created state changelog topics
Added toString() for InternalTopicMetadata and InternalTopicConfig for 
above debug logging




> Created state changelog topics not logged correctly
> ---
>
> Key: KAFKA-5469
> URL: https://issues.apache.org/jira/browse/KAFKA-5469
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Paolo Patierno
>Assignee: Paolo Patierno
>
> In the StreamPartitionAssignor class, the created state changelog topics 
> aren't not logged correctly if the DEBUG log level is set.
> {code}
> [2017-06-19 12:18:44,186] DEBUG stream-thread 
> [streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
> Created state changelog topics 
> {streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-changelog=org.apache.kafka.streams.processor.internals.StreamPartitionAssignor$InternalTopicMetadata@65ae7693}
>  from the parsed topology. 
> (org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:477)
> {code}
> against repartition topics which are well logged :
> {code}
> [2017-06-19 12:18:37,871] DEBUG stream-thread 
> [streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
> Created repartition topics [Partition(topic = 
> streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-repartition, 
> partition = 0, leader = none, replicas = [], isr = [])] from the parsed 
> topology. 
> (org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:402)
> {code}
> At same time if source topics are not created before launching the stream 
> application, the state changelog topics log shows just {} (the placeholder) 
> while for repartition topics the [] that is right because it's an empty list.
> It seems that there are two problems. For state changelog topics the values() 
> is not used but then InternalTopicMetadata hasn't a toString() for having a 
> well formatted output.
> I'm already working on that ..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5469) Created state changelog topics not logged correctly

2017-06-19 Thread Paolo Patierno (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Patierno updated KAFKA-5469:
--
Description: 
In the StreamPartitionAssignor class, the created state changelog topics aren't 
not logged correctly if the DEBUG log level is set.

{code}
[2017-06-19 12:18:44,186] DEBUG stream-thread 
[streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
Created state changelog topics 
{streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-changelog=org.apache.kafka.streams.processor.internals.StreamPartitionAssignor$InternalTopicMetadata@65ae7693}
 from the parsed topology. 
(org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:477)
{code}

against repartition topics which are well logged :

{code}
[2017-06-19 12:18:37,871] DEBUG stream-thread 
[streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
Created repartition topics [Partition(topic = 
streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-repartition, 
partition = 0, leader = none, replicas = [], isr = [])] from the parsed 
topology. 
(org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:402)
{code}

At same time if source topics are not created before launching the stream 
application, the state changelog topics log shows just {} (the placeholder) 
while for repartition topics the [] that is right because it's an empty list.

It seems that there are two problems. For state changelog topics the values() 
is not used but then InternalTopicMetadata hasn't a toString() for having a 
well formatted output.
I'm already working on that ..

  was:
In the StreamPartitionAssignor class, the created state changelog topics aren't 
not logged correctly if the DEBUG log level is set.

[2017-06-19 12:18:44,186] DEBUG stream-thread 
[streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
Created state changelog topics 
{streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-changelog=org.apache.kafka.streams.processor.internals.StreamPartitionAssignor$InternalTopicMetadata@65ae7693}
 from the parsed topology. 
(org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:477)

against repartition topics which are well logged :

[2017-06-19 12:18:37,871] DEBUG stream-thread 
[streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
Created repartition topics [Partition(topic = 
streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-repartition, 
partition = 0, leader = none, replicas = [], isr = [])] from the parsed 
topology. 
(org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:402)

At same time if source topics are not created before launching the stream 
application, the state changelog topics log shows just {} (the placeholder) 
while for repartition topics the [] that is right because it's an empty list.

It seems that there are two problems. For state changelog topics the values() 
is not used but then InternalTopicMetadata hasn't a toString() for having a 
well formatted output.
I'm already working on that ..


> Created state changelog topics not logged correctly
> ---
>
> Key: KAFKA-5469
> URL: https://issues.apache.org/jira/browse/KAFKA-5469
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Paolo Patierno
>Assignee: Paolo Patierno
>
> In the StreamPartitionAssignor class, the created state changelog topics 
> aren't not logged correctly if the DEBUG log level is set.
> {code}
> [2017-06-19 12:18:44,186] DEBUG stream-thread 
> [streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
> Created state changelog topics 
> {streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-changelog=org.apache.kafka.streams.processor.internals.StreamPartitionAssignor$InternalTopicMetadata@65ae7693}
>  from the parsed topology. 
> (org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:477)
> {code}
> against repartition topics which are well logged :
> {code}
> [2017-06-19 12:18:37,871] DEBUG stream-thread 
> [streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
> Created repartition topics [Partition(topic = 
> streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-repartition, 
> partition = 0, leader = none, replicas = [], isr = [])] from the parsed 
> topology. 
> (org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:402)
> {code}
> At same time if source topics are not created before launching the stream 
> application, the state changelog topics log shows just {} (the placeholder) 
> while for repartition topics the [] that is right because it's an empty list.
> It seems that there are two problems. For state changelog topics the values() 
> is not used but then InternalTopicMetadata hasn't a to

[jira] [Created] (KAFKA-5469) Created state changelog topics not logged correctly

2017-06-19 Thread Paolo Patierno (JIRA)
Paolo Patierno created KAFKA-5469:
-

 Summary: Created state changelog topics not logged correctly
 Key: KAFKA-5469
 URL: https://issues.apache.org/jira/browse/KAFKA-5469
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Paolo Patierno
Assignee: Paolo Patierno


In the StreamPartitionAssignor class, the created state changelog topics aren't 
not logged correctly if the DEBUG log level is set.

[2017-06-19 12:18:44,186] DEBUG stream-thread 
[streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
Created state changelog topics 
{streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-changelog=org.apache.kafka.streams.processor.internals.StreamPartitionAssignor$InternalTopicMetadata@65ae7693}
 from the parsed topology. 
(org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:477)

against repartition topics which are well logged :

[2017-06-19 12:18:37,871] DEBUG stream-thread 
[streams-temperature-6d25c7ff-0927-4469-8ab3-2685e7d8-StreamThread-1] 
Created repartition topics [Partition(topic = 
streams-temperature-KSTREAM-REDUCE-STATE-STORE-02-repartition, 
partition = 0, leader = none, replicas = [], isr = [])] from the parsed 
topology. 
(org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:402)

At same time if source topics are not created before launching the stream 
application, the state changelog topics log shows just {} (the placeholder) 
while for repartition topics the [] that is right because it's an empty list.

It seems that there are two problems. For state changelog topics the values() 
is not used but then InternalTopicMetadata hasn't a toString() for having a 
well formatted output.
I'm already working on that ..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-06-19 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053750#comment-16053750
 ] 

Ismael Juma commented on KAFKA-4900:


[~pengwei], can you please double-check that the following PR has not fixed the 
issue you are describing?

https://github.com/apache/kafka/pull/2935

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for 
> a number of topics, which appeared to get stuck in the "in progress" state 
> (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following 
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR 
> [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener 
> on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
>   

[jira] [Commented] (KAFKA-3741) KStream config for changelog min.in.sync.replicas

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053728#comment-16053728
 ] 

ASF GitHub Bot commented on KAFKA-3741:
---

Github user dguy closed the pull request at:

https://github.com/apache/kafka/pull/3255


> KStream config for changelog min.in.sync.replicas
> -
>
> Key: KAFKA-3741
> URL: https://issues.apache.org/jira/browse/KAFKA-3741
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.10.0.0
>Reporter: Roger Hoover
>Assignee: Damian Guy
>  Labels: api
>
> Kafka Streams currently allows you to specify a replication factor for 
> changelog and repartition topics that it creates.  It should also allow you 
> to specify min.in.sync.replicas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics

2017-06-19 Thread Pengwei (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053661#comment-16053661
 ] 

Pengwei commented on KAFKA-4900:


I found this issue seems because of when controller sending the message failed, 
it will call : controllerElector.resign().
In the controllerElector.resign(), it will set the leaderId = -1 and try to 
delete the zk node.  After set the leaderId to -1, the zk client to handle
the LeaderChangeListener.handleDataDeleted, it will think the previous 
controller is not itself because  leaderId is set to -1, and try to elect 
again.  Then the previous controller-to-broker-send-threads are not closed and 
try to start the new controller-to-broker-sned-threads then will found this 
issue.

> Brokers stuck in controller re-election loop after failing to register metrics
> --
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core
>Affects Versions: 0.10.1.1
>Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception 
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR 
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] 
> server.ZookeeperLeaderElector - Error while electing or becoming leader on 
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName 
> [name=connection-close-rate, 
> group=controller-channel-metrics,description=Connections closed per second in 
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at 
> org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617)
> at org.apache.kafka.common.network.Selector.(Selector.java:138)
> at 
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at 
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at 
> kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45)
> at 
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at 
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this 
> exception, and leadership transitions to the next, which then hits the 
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce 
> and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for 
> a number of topics, which appeared to get stuck in the "in progress" state 
> (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on 
> broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these 
> exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in 
> an attempt to clear the exception. We were able to restart the cluster, but 
> we continue to see the exceptions
> We also observed, d