date:20151201


[ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035355#comment-15035355
 ] 

Jason Gustafson commented on KAFKA-2908:


An additional note from the test failure: The last message consumed from 
partition 1 occurred just before controller 3 transferred leadership of the 
partition to broker 1. After this, both the producer and consumer fail to do 
much of anything for about 20 seconds. No messages are produced and no messages 
are consumed. My best guess is that they are both blocked awaiting a metadata 
update. After the 20 seconds passes, controller 3 begins shutting down as part 
of the rolling reboot. This seems to unblock whatever the consumer and producer 
were waiting for. The producer is then able to get a metadata update (this 
appears in the logs) and is able to produce to all partitions. Likewise, the 
consumer begins fetching data again, but only for two of the three partitions. 
We don't have debug logs for the consumer, so it's tough to say more than that. 
However, the fact that both the consumer and producer were blocked for so long 
suggests a problem at NetworkClient or below. There may also be a problem in 
the consumer since it didn't recover completely as the producer apparently did.

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-936) Kafka Metrics Memory Leak

2015-12-01 Thread Hechen Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035307#comment-15035307
 ] 

Hechen Gao commented on KAFKA-936:
--

I am using kafka_2.10-0.8.1.1.jar in my project (which depend on 
metrics-core-2.2.0.jar), and i also ran into this issue recently.
Metrics related object such as ConcurrentSkipListMap used up to 
90.5% of the memory.

> Kafka Metrics Memory Leak 
> --
>
> Key: KAFKA-936
> URL: https://issues.apache.org/jira/browse/KAFKA-936
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.8.0
> Environment: centos linux, jdk 1.6, jboss
>Reporter: Senthil Chittibabu
>Assignee: Neha Narkhede
>Priority: Critical
>
> I am using kafka_2.8.0-0.8.0-SNAPSHOT version. I am running into 
> OutOfMemoryError in PermGen Space. I have set the -XX:MaxPermSize=512m, but I 
> still get the same error. I used profiler to trace the memory leak, and found 
> the following kafka classes to be the cause for the memory leak. Please let 
> me know if you need any additional information to debug this issue. 
> kafka.server.FetcherLagMetrics
> kafka.consumer.FetchRequestAndResponseMetrics
> kafka.consumer.FetchRequestAndResponseStats
> kafka.metrics.KafkaTimer
> kafka.utils.Pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-656) Add Quotas to Kafka

2015-12-01 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KAFKA-656.
---
Resolution: Duplicate

> Add Quotas to Kafka
> ---
>
> Key: KAFKA-656
> URL: https://issues.apache.org/jira/browse/KAFKA-656
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 0.8.1
>Reporter: Jay Kreps
>  Labels: project
>
> It would be nice to implement a quota system in Kafka to improve our support 
> for highly multi-tenant usage. The goal of this system would be to prevent 
> one naughty user from accidently overloading the whole cluster.
> There are several quantities we would want to track:
> 1. Requests pers second
> 2. Bytes written per second
> 3. Bytes read per second
> There are two reasonable groupings we would want to aggregate and enforce 
> these thresholds at:
> 1. Topic level
> 2. Client level (e.g. by client id from the request)
> When a request hits one of these limits we will simply reject it with a 
> QUOTA_EXCEEDED exception.
> To avoid suddenly breaking things without warning, we should ideally support 
> two thresholds: a soft threshold at which we produce some kind of warning and 
> a hard threshold at which we give the error. The soft threshold could just be 
> defined as 80% (or whatever) of the hard threshold.
> There are nuances to getting this right. If you measure second-by-second a 
> single burst may exceed the threshold, so we need a sustained measurement 
> over a period of time.
> Likewise when do we stop giving this error? To make this work right we likely 
> need to charge against the quota for request *attempts* not just successful 
> requests. Otherwise a client that is overloading the server will just flap on 
> and off--i.e. we would disable them for a period of time but when we 
> re-enabled them they would likely still be abusing us.
> It would be good to a wiki design on how this would all work as a starting 
> point for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: modify config specification of topic level

2015-12-01 Thread EamonZhang

GitHub user EamonZhang opened a pull request:

https://github.com/apache/kafka/pull/612

modify config specification of topic level

topic level config delete config options

use --delete-config instead of --deleteConfig

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/EamonZhang/kafka trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/612.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #612


commit b72a92885a83f7991879393f920e8d1330dbe1e1
Author: eamon 
Date:   2015-12-02T02:54:18Z

modify config




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (KAFKA-2928) system tests: failures in version-related sanity checks

2015-12-01 Thread Geoff Anderson (JIRA)

Geoff Anderson created KAFKA-2928:
-

 Summary: system tests: failures in version-related sanity checks
 Key: KAFKA-2928
 URL: https://issues.apache.org/jira/browse/KAFKA-2928
 Project: Kafka
  Issue Type: Bug
Reporter: Geoff Anderson


There have been a few consecutive failures of version-related sanity checks in 
nightly system test runs:
kafkatest.sanity_checks.test_verifiable_producer
kafkatest.sanity_checks.test_kafka_version

assert is_version(...) is failing
utils.util.is_version is a fairly rough heuristic, so most likely this needs to 
be updated.

E.g., see
http://testing.confluent.io/kafka/2015-12-01--001/
(if this is broken, use 
http://testing.confluent.io/kafka/2015-12-01--001.tar.gz)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-2926) [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer

2015-12-01 Thread Gwen Shapira (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gwen Shapira reassigned KAFKA-2926:
---

Assignee: Gwen Shapira

> [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer
> --
>
> Key: KAFKA-2926
> URL: https://issues.apache.org/jira/browse/KAFKA-2926
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0.0
>Reporter: Gwen Shapira
>Assignee: Gwen Shapira
>
> MirrorMaker has an internal rebalance listener that will invoke an external 
> (pluggable) listener if such exists. Looks like the internal listener calls 
> the wrong method of the external listener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2927) System tests: reduce storage footprint of collected logs

2015-12-01 Thread Geoff Anderson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geoff Anderson updated KAFKA-2927:
--
Summary: System tests: reduce storage footprint of collected logs  (was: 
System tests: reduce storage footprint of KafkaService)

> System tests: reduce storage footprint of collected logs
> 
>
> Key: KAFKA-2927
> URL: https://issues.apache.org/jira/browse/KAFKA-2927
> Project: Kafka
>  Issue Type: Bug
>Reporter: Geoff Anderson
>Assignee: Geoff Anderson
>
> Looking at recent night test runs (testing.confluent.io/kafka), the storage 
> requirements for log output from the various services has increased 
> significantly, up to 7-10G for a single test run, up from hundreds of MB
> Current breakdown:
> 23M   Benchmark
> 3.2M  ClientCompatibilityTest
> 613M  ConnectDistributedTest
> 1.1M  ConnectRestApiTest
> 1.5M  ConnectStandaloneFileTest
> 2.0M  ConsoleConsumerTest
> 440K  KafkaVersionTest
> 744K  Log4jAppenderTest
> 49M   QuotaTest
> 3.0G  ReplicationTest
> 1.2G  TestMirrorMakerService
> 185M  TestUpgrade
> 372K  TestVerifiableProducer
> 2.3G  VerifiableConsumerTest
> The biggest contributors in these test suites:
> ReplicationTest:
> verifiable_producer.log (currently TRACE level)
> VerifiableConsumerTest:
> kafka server.log
> TestMirrorMakerService:
> verifiable_producer.log
> ConnectDistributedTest:
> kafka server.log
> The worst offenders are therefore 
> verifiable_producer.log which is logging at TRACE level, and kafka server.log 
> which is logging at debug level
> One solution is to:
> 1) Update the log4j configs to log separately to both an INFO level file, and 
> another file for DEBUG at least for the worst offenders.
> 2) Don't collect these DEBUG (and below) logs by default; only mark for 
> collection during failure



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2927) System tests: reduce storage footprint of KafkaService

2015-12-01 Thread Geoff Anderson (JIRA)

Geoff Anderson created KAFKA-2927:
-

 Summary: System tests: reduce storage footprint of KafkaService
 Key: KAFKA-2927
 URL: https://issues.apache.org/jira/browse/KAFKA-2927
 Project: Kafka
  Issue Type: Bug
Reporter: Geoff Anderson
Assignee: Geoff Anderson


Looking at recent night test runs (testing.confluent.io/kafka), the storage 
requirements for log output from the various services has increased 
significantly, up to 7-10G for a single test run, up from hundreds of MB

Current breakdown:
23M Benchmark
3.2MClientCompatibilityTest
613MConnectDistributedTest
1.1MConnectRestApiTest
1.5MConnectStandaloneFileTest
2.0MConsoleConsumerTest
440KKafkaVersionTest
744KLog4jAppenderTest
49M QuotaTest
3.0GReplicationTest
1.2GTestMirrorMakerService
185MTestUpgrade
372KTestVerifiableProducer
2.3GVerifiableConsumerTest


The biggest contributors in these test suites:

ReplicationTest:
verifiable_producer.log (currently TRACE level)

VerifiableConsumerTest:
kafka server.log

TestMirrorMakerService:
verifiable_producer.log

ConnectDistributedTest:
kafka server.log


The worst offenders are therefore 
verifiable_producer.log which is logging at TRACE level, and kafka server.log 
which is logging at debug level

One solution is to:
1) Update the log4j configs to log separately to both an INFO level file, and 
another file for DEBUG at least for the worst offenders.

2) Don't collect these DEBUG (and below) logs by default; only mark for 
collection during failure




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2926) [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer


[ 
https://issues.apache.org/jira/browse/KAFKA-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035135#comment-15035135
 ] 

ASF GitHub Bot commented on KAFKA-2926:
---

GitHub user gwenshap opened a pull request:

https://github.com/apache/kafka/pull/611

KAFKA-2926: [MirrorMaker] InternalRebalancer calls wrong method of ex…

…ternal rebalancer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gwenshap/kafka KAFKA-2926

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/611.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #611


commit a5a4db994c0facd09716ec2c51f9c8ea868e2e87
Author: Gwen Shapira 
Date:   2015-12-02T02:13:04Z

KAFKA-2926: [MirrorMaker] InternalRebalancer calls wrong method of external 
rebalancer




> [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer
> --
>
> Key: KAFKA-2926
> URL: https://issues.apache.org/jira/browse/KAFKA-2926
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0.0
>Reporter: Gwen Shapira
>
> MirrorMaker has an internal rebalance listener that will invoke an external 
> (pluggable) listener if such exists. Looks like the internal listener calls 
> the wrong method of the external listener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: KAFKA-2926: [MirrorMaker] InternalRebalancer c...

2015-12-01 Thread gwenshap

GitHub user gwenshap opened a pull request:

https://github.com/apache/kafka/pull/611

KAFKA-2926: [MirrorMaker] InternalRebalancer calls wrong method of exâ¦

â¦ternal rebalancer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gwenshap/kafka KAFKA-2926

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/611.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #611


commit a5a4db994c0facd09716ec2c51f9c8ea868e2e87
Author: Gwen Shapira 
Date:   2015-12-02T02:13:04Z

KAFKA-2926: [MirrorMaker] InternalRebalancer calls wrong method of external 
rebalancer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (KAFKA-2926) [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer

2015-12-01 Thread Gwen Shapira (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gwen Shapira updated KAFKA-2926:

Description: MirrorMaker has an internal rebalance listener that will 
invoke an external (pluggable) listener if such exists. Looks like the internal 
listener calls the wrong method of the external listener.  (was: MirrorMaker 
has an internal rebalance listener that will invoke an external (pluggable) 
listener if such exists. )

> [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer
> --
>
> Key: KAFKA-2926
> URL: https://issues.apache.org/jira/browse/KAFKA-2926
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.9.0.0
>Reporter: Gwen Shapira
>
> MirrorMaker has an internal rebalance listener that will invoke an external 
> (pluggable) listener if such exists. Looks like the internal listener calls 
> the wrong method of the external listener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2926) [MirrorMaker] InternalRebalancer calls wrong method of external rebalancer

2015-12-01 Thread Gwen Shapira (JIRA)

Gwen Shapira created KAFKA-2926:
---

 Summary: [MirrorMaker] InternalRebalancer calls wrong method of 
external rebalancer
 Key: KAFKA-2926
 URL: https://issues.apache.org/jira/browse/KAFKA-2926
 Project: Kafka
  Issue Type: Bug
  Components: tools
Affects Versions: 0.9.0.0
Reporter: Gwen Shapira


MirrorMaker has an internal rebalance listener that will invoke an external 
(pluggable) listener if such exists. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1851) OffsetFetchRequest returns extra partitions when input only contains unknown partitions


[ 
https://issues.apache.org/jira/browse/KAFKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035112#comment-15035112
 ] 

ASF GitHub Bot commented on KAFKA-1851:
---

GitHub user apovzner opened a pull request:

https://github.com/apache/kafka/pull/610

KAFKA-1851 Using random file names for local kdc files to avoid conflicts.

I originally tried to solve the problem by using tempfile, and creating and 
using scp() utility method that created a random local temp file every time it 
was called. However, it required passing miniKdc object to SecurityConfig 
setup_node which looked very invasive, since many tests use this method. Here 
is the PR for that, which I think we will close: 
https://github.com/apache/kafka/pull/609

This change is the least invasive change to solve conflicts between 
multiple tests jobs. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apovzner/kafka kafka_2851_01

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/610.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #610


commit 4c9c76825b9dff5fb509eac01592d37f357b5775
Author: Anna Povzner 
Date:   2015-12-02T01:55:08Z

KAFKA-2851:  Using random file names for local kdc files to avoid conflicts




> OffsetFetchRequest returns extra partitions when input only contains unknown 
> partitions
> ---
>
> Key: KAFKA-1851
> URL: https://issues.apache.org/jira/browse/KAFKA-1851
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.0
>Reporter: Jun Rao
>Assignee: Jun Rao
>Priority: Blocker
> Fix For: 0.8.2.0
>
> Attachments: kafka-1851.patch
>
>
> When issuing an OffsetFetchRequest with an unknown topic partition, the 
> OffsetFetchResponse unexpectedly returns all partitions in the same consumer 
> group, in addition to the unknown partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: KAFKA-1851 Using random file names for local k...

2015-12-01 Thread apovzner

GitHub user apovzner opened a pull request:

https://github.com/apache/kafka/pull/610

KAFKA-1851 Using random file names for local kdc files to avoid conflicts.

I originally tried to solve the problem by using tempfile, and creating and 
using scp() utility method that created a random local temp file every time it 
was called. However, it required passing miniKdc object to SecurityConfig 
setup_node which looked very invasive, since many tests use this method. Here 
is the PR for that, which I think we will close: 
https://github.com/apache/kafka/pull/609

This change is the least invasive change to solve conflicts between 
multiple tests jobs. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apovzner/kafka kafka_2851_01

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/610.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #610


commit 4c9c76825b9dff5fb509eac01592d37f357b5775
Author: Anna Povzner 
Date:   2015-12-02T01:55:08Z

KAFKA-2851:  Using random file names for local kdc files to avoid conflicts




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Jay Kreps

Hey Flavio,

Yeah, I think we are largely in agreement on virtually all points.

Where I saw ZK shine was really in in-house infrastructure. LinkedIn had a
dozen in-house systems that all used it, and it wouldn't have made sense
for any of those systems to build their own. Likewise when we started Kafka
there was really only 1-3 developers for a very long time so doing anything
more custom would have been out of reach. I guess the characteristic of
in-house infrastructure is that it has to be cheap to build, and it often
ends up having lots and lots of other system dependencies which is fine so
long as they are things you already run.

For an open source product, though you are kind of optimizing with a
different objective function. You are trying to make the thing easy to get
going with and willing to spend more time to get there. That out-of-the-box
experience of how easy it is to adopt and operationalize is the big issue
in how successful the system is. I think using external consensus systems
ends up not being quite as good here because many people won't already have
the dependency as part of their stack and for them you effectively double
the operational footprint they have to master. I think this is why this is
such a loud and persistent complaint (Joe is right, it is the first
question asked at every talk I give)--they want to adopt one distributed
thingy (Kafka) which is hard enough to monitor, configure, understand etc,
but to get that we make them learn a second one too. Even when people
already have zk in their stack that doesn't really help because it turns
out that sharing the zk cluster safely is usually not easy for people.

Actually ZK itself is really great in this way--i think a lot of it's
success comes from being a totally simple standalone component. If it
required some other system to run (dunno what, but say HDFS or MySQL or
whatever) I think it would be far less successful.

Obviously a lot depends on how much is reused and what the conceptual
"footprint" of the implementation is. If you end up with something in Kafka
that is basically all standalone code (basically a ZK-like service but less
tested and in scala) then that would be totally stupid. I think that is the
outcome you are describing.

On the other hand it might be possible to reuse much more. I have not done
a design for something like this so all of this is totally half baked. But
imagine you implement Raft on Kafka (Kraft!). Basically you have a magical
kafka log "metadata" and appends, recovery, and leadership for this log are
not handled using the normal ISR logic but go through a special code path
that implements the raft algorithm. A subset of 3-5 brokers would be full
participants in maintaining this log and the rest of the cluster would just
passively consume it. All brokers have a callback plugin for processing
events appended to the log (this is what handles things like topic creates
and deletes), and all have a local rocksdb key-value store replicating the
contents of the log which is queryable. This k/v store holds the various
topic configs, metadata entries, etc. You would implement a new
RequestVotes RPC for doing leadership election. The leader of the Raft
election would also serve as the controller for the duration of it's term.

I'm not claiming this is well thought out, it's obviously not. It also
might not be what you ended up with if you fully rethought the relationship
between the consensus log and algorithm used for that and the data logs and
algorithm used for them. Buuut, pretend for a moment this is possible and
the implementation was not buggy. I claim that this setup would be much
operationally simpler then operationalizing, securing, monitoring, and
configuring both Kafka and a separate zk/consul/etcd cluster. The impact is
some additional logic for processing the metadata appends, maybe some
changes to the log implementation, and a new RPC for leadership election.

Thoughts?

-Jay

On Tue, Dec 1, 2015 at 2:25 PM, Flavio Junqueira  wrote:

> I'm new to this community and I don't necessarily have all the background
> around the complaints on ZooKeeper, but I'd like to give my opinion, even
> if biased. To introduce myself in the case folks here don't know me, I'm
> one of the guys who developed ZooKeeper originally and designed the Zab
> protocol.
>
> > On 01 Dec 2015, at 19:58, Jay Kreps  wrote:
> >
> > Hey Joe,
> >
> > Thanks for raising this. People really want to get rid of the ZK
> > dependency, I agree it is among the most asked for things. Let me give a
> > quick critique and a more radical plan.
> >
>
> Putting aside the cases in which people already use or want to use
> something else, what's the concern here exactly? People are always trying
> to get rid of dependencies but they give up when they start looking at the
> alternatives or the difficulty of doing themselves. Also, during this past
> few months that I have been interacting with this community, I noticed at
> least one issue

[jira] [Created] (KAFKA-2925) NullPointerException if FileStreamSinkTask is stopped before initialization finishes

2015-12-01 Thread Ewen Cheslack-Postava (JIRA)

Ewen Cheslack-Postava created KAFKA-2925:


 Summary: NullPointerException if FileStreamSinkTask is stopped 
before initialization finishes
 Key: KAFKA-2925
 URL: https://issues.apache.org/jira/browse/KAFKA-2925
 Project: Kafka
  Issue Type: Bug
  Components: copycat
Affects Versions: 0.9.0.0
Reporter: Ewen Cheslack-Postava
Assignee: Ewen Cheslack-Postava
Priority: Minor


If a FileStreamSinkTask is stopped too quickly after a distributed herder 
rebalances work, it can result in cleanup happening without start() ever being 
called:

{quote}
Sink task org.apache.kafka.connect.runtime.WorkerSinkTask@f9ac651 was stopped 
before completing join group. Task initialization and start is being skipped 
(org.apache.kafka.connect.runtime.WorkerSinkTask:150)
{quote}

This is actually a bit weird since stop() is still called so resources 
allocated in the constructor can be cleaned up, but possibly unexpected that 
stop() will be called without start() ever being called.

Because the code in FileStreamSinkTask's stop() method assumes start() has been 
called, it can result in a NullPointerException because it assumes the 
PrintStream is already initialized.

The easy fix is to check for nulls before closing. However, we should probably 
also consider whether the current possibly sequence of events is confusing and 
if we shoud not invoke stop() and make it clear in the SInkTask interface that 
you should only initialize stuff in the constructor that won't need any manual 
cleanup later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-2851) system tests: error copying keytab file


[ 
https://issues.apache.org/jira/browse/KAFKA-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010231#comment-15010231
 ] 

Anna Povzner edited comment on KAFKA-2851 at 12/2/15 12:10 AM:
---

Pull request: https://github.com/apache/kafka/pull/609




was (Author: apovzner):
Pull request: https://github.com/apache/kafka/pull/518

> system tests: error copying keytab file
> ---
>
> Key: KAFKA-2851
> URL: https://issues.apache.org/jira/browse/KAFKA-2851
> Project: Kafka
>  Issue Type: Bug
>Reporter: Geoff Anderson
>Assignee: Anna Povzner
>Priority: Minor
>
> It is best to use unique paths for temporary files on the test driver machine 
> so that multiple test jobs don't conflict. 
> If the test driver machine is running multiple ducktape jobs concurrently, as 
> is the case with Confluent nightly test runs, conflicts can occur if the same 
> canonical path is always used.
> In this case, security_config.py copies a file to /tmp/keytab on the test 
> driver machine, while other jobs may remove this from the driver machine. 
> Then you can get errors like this:
> {code}
> 
> test_id:
> 2015-11-17--001.kafkatest.tests.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=SASL_PLAINTEXT.failure_mode=clean_bounce
> status: FAIL
> run time:   1 minute 33.395 seconds
> 
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/tests/runner.py",
>  line 101, in run_all_tests
> result.data = self.run_single_test()
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/tests/runner.py",
>  line 151, in run_single_test
> return self.current_test_context.function(self.current_test)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/mark/_mark.py",
>  line 331, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/replication_test.py",
>  line 132, in test_replication_with_broker_failure
> self.run_produce_consume_validate(core_test_action=lambda: 
> failures[failure_mode](self))
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 66, in run_produce_consume_validate
> core_test_action()
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/replication_test.py",
>  line 132, in 
> self.run_produce_consume_validate(core_test_action=lambda: 
> failures[failure_mode](self))
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/replication_test.py",
>  line 43, in clean_bounce
> test.kafka.restart_node(prev_leader_node, clean_shutdown=True)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/services/kafka/kafka.py",
>  line 275, in restart_node
> self.start_node(node)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/services/kafka/kafka.py",
>  line 123, in start_node
> self.security_config.setup_node(node)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/services/security/security_config.py",
>  line 130, in setup_node
> node.account.scp_to(MiniKdc.LOCAL_KEYTAB_FILE, SecurityConfig.KEYTAB_PATH)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/cluster/remoteaccount.py",
>  line 174, in scp_to
> return self._ssh_quiet(self.scp_to_command(src, dest, recursive))
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/cluster/remoteaccount.py",
>  line 219, in _ssh_quiet
> raise e
> CalledProcessError: Command 'scp -o 'HostName 52.33.250.202' -o 'Port 22' -o 
> 'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
> 'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' 
> -o 'IdentitiesOnly yes' -o 'LogLevel FATAL'  /tmp/keytab 
> ubuntu@worker2:/mnt/security/keytab' returned non-zero exit status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (KAFKA-2851) system tests: error copying keytab file


 [ 
https://issues.apache.org/jira/browse/KAFKA-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anna Povzner updated KAFKA-2851:

Comment: was deleted

(was: In the same PR as KAFKA-2825)

> system tests: error copying keytab file
> ---
>
> Key: KAFKA-2851
> URL: https://issues.apache.org/jira/browse/KAFKA-2851
> Project: Kafka
>  Issue Type: Bug
>Reporter: Geoff Anderson
>Assignee: Anna Povzner
>Priority: Minor
>
> It is best to use unique paths for temporary files on the test driver machine 
> so that multiple test jobs don't conflict. 
> If the test driver machine is running multiple ducktape jobs concurrently, as 
> is the case with Confluent nightly test runs, conflicts can occur if the same 
> canonical path is always used.
> In this case, security_config.py copies a file to /tmp/keytab on the test 
> driver machine, while other jobs may remove this from the driver machine. 
> Then you can get errors like this:
> {code}
> 
> test_id:
> 2015-11-17--001.kafkatest.tests.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=SASL_PLAINTEXT.failure_mode=clean_bounce
> status: FAIL
> run time:   1 minute 33.395 seconds
> 
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/tests/runner.py",
>  line 101, in run_all_tests
> result.data = self.run_single_test()
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/tests/runner.py",
>  line 151, in run_single_test
> return self.current_test_context.function(self.current_test)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/mark/_mark.py",
>  line 331, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/replication_test.py",
>  line 132, in test_replication_with_broker_failure
> self.run_produce_consume_validate(core_test_action=lambda: 
> failures[failure_mode](self))
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 66, in run_produce_consume_validate
> core_test_action()
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/replication_test.py",
>  line 132, in 
> self.run_produce_consume_validate(core_test_action=lambda: 
> failures[failure_mode](self))
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/tests/replication_test.py",
>  line 43, in clean_bounce
> test.kafka.restart_node(prev_leader_node, clean_shutdown=True)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/services/kafka/kafka.py",
>  line 275, in restart_node
> self.start_node(node)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/services/kafka/kafka.py",
>  line 123, in start_node
> self.security_config.setup_node(node)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/tests/kafkatest/services/security/security_config.py",
>  line 130, in setup_node
> node.account.scp_to(MiniKdc.LOCAL_KEYTAB_FILE, SecurityConfig.KEYTAB_PATH)
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/cluster/remoteaccount.py",
>  line 174, in scp_to
> return self._ssh_quiet(self.scp_to_command(src, dest, recursive))
>   File 
> "/var/lib/jenkins/workspace/kafka_system_tests/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.3.8-py2.7.egg/ducktape/cluster/remoteaccount.py",
>  line 219, in _ssh_quiet
> raise e
> CalledProcessError: Command 'scp -o 'HostName 52.33.250.202' -o 'Port 22' -o 
> 'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
> 'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' 
> -o 'IdentitiesOnly yes' -o 'LogLevel FATAL'  /tmp/keytab 
> ubuntu@worker2:/mnt/security/keytab' returned non-zero exit status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1851) OffsetFetchRequest returns extra partitions when input only contains unknown partitions


[ 
https://issues.apache.org/jira/browse/KAFKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034929#comment-15034929
 ] 

ASF GitHub Bot commented on KAFKA-1851:
---

GitHub user apovzner opened a pull request:

https://github.com/apache/kafka/pull/609

KAFKA-1851 Using random dir under /temp for local kdc files to avoid 
conflicts.

when multiple test jobs are running.

I manually separated changes for KAFKA-2851 from this PR:  
https://github.com/apache/kafka/pull/570 which also had KAFKA-2825 changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apovzner/kafka kafka-2851

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/609.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #609


commit d19a8533243a77f026a4f547bad40fd10dd68745
Author: Anna Povzner 
Date:   2015-12-02T00:02:13Z

KAFKA-1851 Using random dir under /temp for local kdc files to avoid 
conflicts when multiple test jobs are running.




> OffsetFetchRequest returns extra partitions when input only contains unknown 
> partitions
> ---
>
> Key: KAFKA-1851
> URL: https://issues.apache.org/jira/browse/KAFKA-1851
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.2.0
>Reporter: Jun Rao
>Assignee: Jun Rao
>Priority: Blocker
> Fix For: 0.8.2.0
>
> Attachments: kafka-1851.patch
>
>
> When issuing an OffsetFetchRequest with an unknown topic partition, the 
> OffsetFetchResponse unexpectedly returns all partitions in the same consumer 
> group, in addition to the unknown partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: KAFKA-1851 Using random dir under /temp for lo...

2015-12-01 Thread apovzner

GitHub user apovzner opened a pull request:

https://github.com/apache/kafka/pull/609

KAFKA-1851 Using random dir under /temp for local kdc files to avoid 
conflicts.

when multiple test jobs are running.

I manually separated changes for KAFKA-2851 from this PR:  
https://github.com/apache/kafka/pull/570 which also had KAFKA-2825 changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apovzner/kafka kafka-2851

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/609.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #609


commit d19a8533243a77f026a4f547bad40fd10dd68745
Author: Anna Povzner 
Date:   2015-12-02T00:02:13Z

KAFKA-1851 Using random dir under /temp for local kdc files to avoid 
conflicts when multiple test jobs are running.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Comment Edited] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034892#comment-15034892
 ] 

Rajini Sivaram edited comment on KAFKA-2891 at 12/1/15 11:53 PM:
-

[~benstopford]  The logs from my failing test runs all show the same pattern - 
min.insync.replicas set to one and messages acked when leader is the only ISR. 
When the leader gets killed by the test, messages are lost, as you would 
expect. The test was intended to run with min.insync.replicas set to 2, but due 
to a bug in the way min.insync.replicas was being set for topics, it was being 
left as default of one. All tests which currently set min.insync.replicas have 
copied the same config with the result that the config is never set. I have 
updated the PR for KAFKA-2642 with a fix for the min.insync.replicas setting in 
all the tests which set this. Have scheduled a build with the fix and will 
check the results in the morning.


was (Author: rsivaram):
[~benstopford]  The logs from my failing test runs all show the same pattern - 
ISR set to 1 and messages acked when leader is the only ISR. When the leader 
gets killed by the test, messages are lost, as you would expect. The test was 
intended to run with min.insync.replicas set to 2, but due to a bug in the way 
min.insync.replicas was being set for topics, it was being left as default of 
one. All tests which currently set min.insync.replicas have copied the same 
config with the result that the config is never set. I have updated the PR for 
KAFKA-2642 with a fix for the min.insync.replicas setting in all the tests 
which set this. Have scheduled a build with the fix and will check the results 
in the morning.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034892#comment-15034892
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford]  The logs from my failing test runs all show the same pattern - 
ISR set to 1 and messages acked when leader is the only ISR. When the leader 
gets killed by the test, messages are lost, as you would expect. The test was 
intended to run with min.insync.replicas set to 2, but due to a bug in the way 
min.insync.replicas was being set for topics, it was being left as default of 
one. All tests which currently set min.insync.replicas have copied the same 
config with the result that the config is never set. I have updated the PR for 
KAFKA-2642 with a fix for the min.insync.replicas setting in all the tests 
which set this. Have scheduled a build with the fix and will check the results 
in the morning.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034878#comment-15034878
 ] 

Jason Gustafson commented on KAFKA-2908:


After inspecting the broker data logs, I found that there are two kinds of gaps 
in the consumer's output. The first are gaps from messages which were not 
successfully written to the server. All of the remaining gaps occur at the tail 
of partition 1 of the test topic. This suggests that the consumer stopped 
sending fetch for that partition for a prolonged period of time which extended 
to the end of the test. This could happen either because the current connection 
did not close cleanly (which would cause a delay as long as the request 
timeout), or the consumer was unable to find the leader for the partition. I'm 
trying to get as much info out of the data here, but we may need to reproduce 
the error with more logging enabled to narrow the problem down further. 

The good news is that I have not found any cases where the consumer actually 
skipped over messages. 


> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2924) Add offsets/group metadata decoder so that DumpLogSegments can be used with the offsets topic

Jason Gustafson created KAFKA-2924:
--

 Summary: Add offsets/group metadata decoder so that 
DumpLogSegments can be used with the offsets topic
 Key: KAFKA-2924
 URL: https://issues.apache.org/jira/browse/KAFKA-2924
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson
Assignee: Jason Gustafson


We've only implemented a MessageFormatter for use with the ConsoleConsumer, but 
it would be helpful to be able to pull offsets/metadata from log files directly 
in testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Flavio Junqueira

I'm new to this community and I don't necessarily have all the background 
around the complaints on ZooKeeper, but I'd like to give my opinion, even if 
biased. To introduce myself in the case folks here don't know me, I'm one of 
the guys who developed ZooKeeper originally and designed the Zab protocol.

> On 01 Dec 2015, at 19:58, Jay Kreps  wrote:
> 
> Hey Joe,
> 
> Thanks for raising this. People really want to get rid of the ZK
> dependency, I agree it is among the most asked for things. Let me give a
> quick critique and a more radical plan.
> 

Putting aside the cases in which people already use or want to use something 
else, what's the concern here exactly? People are always trying to get rid of 
dependencies but they give up when they start looking at the alternatives or 
the difficulty of doing themselves. Also, during this past few months that I 
have been interacting with this community, I noticed at least one issue that 
was around for a long time, was causing some pain, and it was due to a poor 
understanding of the ZK semantics. I'm not convinced that doing your own will 
make it go away, unless you have a group of people dedicated to doing it, but 
then you're doing ZK, etcd, Consul all over again, with the difference that it 
is in your backyard.

> I don't think making ZK pluggable is the right thing to do. I have a lot of
> experience with this dynamic of introducing plugins for core functionality
> because I previously worked on a key-value store called Voldemort in which
> we made both the protocol and storage engine totally pluggable. I
> originally felt this was a good thing both philosophically and practically,
> but in retrospect came to believe it was a huge mistake--what people really
> wanted was one really excellent implementation with the kind of insane
> levels of in-production usage and test coverage that infrastructure
> demands. Pluggability is actually really at odds with this, and the ability
> to actually abstract over some really meaty dependency like a storage
> engine never quite works.

Adding another layer of abstraction is likely to cause even more pain. Right 
now there is ZkUtils -> ZkClient -> ZooKeeper client. 

> 
> People dislike the ZK dependency because it effectively doubles the
> operational load of Kafka--it doubles the amount of configuration,
> monitoring, and understanding needed.

I agree, but it isn't necessarily a problem with the dependency, but how people 
think about the dependency. Taking as an example the recent security work, SASL 
was done separately for brokers and zk. It would make configuration easier if 
we just said SASL and under the hood configured what is necessary. Let me 
stress that I'm not complaining about the security work, and it is even 
possible that they were separated for different reasons, like a zk ensemble 
being shared, but I feel this is an example in which the developers in this 
community tend to think of it separately. Actually, one interesting point is 
that some of the security apparently was borrowed from zk. Not a problem, it is 
an open-source project after all, but there could be better synergy here. 

> Replacing ZK with a similar system
> won't fix this problem though--all the other consensus services are equally
> complex (and often less mature)--and it will cause two new problems. First
> there will be a layer of indirection that will make reasoning and improving
> the ZK implementation harder. For example, note that your plug-in api
> doesn't seem to cover multi-get and multi-write, when we added that we
> would end up breaking all plugins. Each new thing will be like that. Ops
> tools, config, documentation, etc will no longer be able to include any
> coverage of ZK because we can't assume ZK so all that becomes much harder.
> The second problem is that this introduces a combinatorial testing problem.
> People say they want to swap out ZK but they are assuming whatever they
> swap in will work equally well. How will we know that is true? The only way
> to explode out the testing to run with every possible plugin.

That's possibly less of a problem if you have a single instance that the 
community supports. Others can implement their option, but the community 
doesn't necessarily have to maintain or make sure they all work prior to 
releases.

> 
> If you want to see this in action take a look at ActiveMQ. ActiveMQ is less
> a system than a family of co-operating plugins and a configuration language
> for assembling them. Software engineers and open source communities are
> really prone to this kind of thing because "we can just make it pluggable"
> ends any argument. But the actual implementation is a mess, and later
> improvements in their threading, I/O, and other core models simply couldn't
> be made across all the plugins.
> 
> This blog post on configurability in UI is a really good summary of a
> similar dynamic:
> http://ometer.com/free-software-ui.html
> 
> Anyhow, not to go too far off on a r

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Jay Kreps

Hey Joe,

I agree that you would need some transition plan, but think we may be
proposing two different things.

I think I'm proposing something like this:
1. Make tactical improvements to ZK usage as Neha describes
2. Scope out the longer term design for how we want to handle this
3. Build some kind of bridging functionality to get people from (1) to (2)

But the long term approach is that (2) is the way it works, the abstraction
that subsystem provides would be what the code uses, etc.

I'm basically not in favor of making this area a user-facing plugin because
I think people don't want to swap in other systems, really, they just want
the zk dependency to go away. I think real testing and design requires
being able to depend on the performance characteristics, errors, and very
very subtle details of the implementation and I'm just skeptical that that
will really ever become truly pluggable. And as I mentioned my past
experience with this type of deep plug-in has been uniformly negative.

-Jay

On Tue, Dec 1, 2015 at 12:25 PM, Joe Stein  wrote:

> Yeah, lets do both! :) I always had trepidations about leaving things as is
> with ZooKeeper there. Can we have this new internal system be what replaces
> that but still make it modular somewhat.
>
> The problem with any new system is that everyone already trusts and relies
> on the existing scars we know heal. That is why we all are still using
> ZooKeeper ( I bet at least 3 clusters are still on 3.3.4 and one maybe
> 3.3.1 or something nutty ).
>
> etcd
> consul
> c*
> riak
> akka
>
> All have viable solutions and i have no idea what will be best or worst or
> even work but lots of folks are working on it now trying to get things to
> be different and work right for them.
>
> I think a native version should be there in the project and I am 100% on
> board with that native version NOT be ZooKeeper but homegrown.
>
> I also think the native default should use the KIP-30 interface so other
> server can also connect the feature they are solving also (that way
> deployments that have already adopted XYZ for consensus can use it).
>
> ~ Joe Stein
> - - - - - - - - - - - - - - - - - - -
>  [image: Logo-Black.jpg]
>   http://www.elodina.net
> http://www.stealth.ly
> - - - - - - - - - - - - - - - - - - -
>
> On Tue, Dec 1, 2015 at 2:58 PM, Jay Kreps  wrote:
>
> > Hey Joe,
> >
> > Thanks for raising this. People really want to get rid of the ZK
> > dependency, I agree it is among the most asked for things. Let me give a
> > quick critique and a more radical plan.
> >
> > I don't think making ZK pluggable is the right thing to do. I have a lot
> of
> > experience with this dynamic of introducing plugins for core
> functionality
> > because I previously worked on a key-value store called Voldemort in
> which
> > we made both the protocol and storage engine totally pluggable. I
> > originally felt this was a good thing both philosophically and
> practically,
> > but in retrospect came to believe it was a huge mistake--what people
> really
> > wanted was one really excellent implementation with the kind of insane
> > levels of in-production usage and test coverage that infrastructure
> > demands. Pluggability is actually really at odds with this, and the
> ability
> > to actually abstract over some really meaty dependency like a storage
> > engine never quite works.
> >
> > People dislike the ZK dependency because it effectively doubles the
> > operational load of Kafka--it doubles the amount of configuration,
> > monitoring, and understanding needed. Replacing ZK with a similar system
> > won't fix this problem though--all the other consensus services are
> equally
> > complex (and often less mature)--and it will cause two new problems.
> First
> > there will be a layer of indirection that will make reasoning and
> improving
> > the ZK implementation harder. For example, note that your plug-in api
> > doesn't seem to cover multi-get and multi-write, when we added that we
> > would end up breaking all plugins. Each new thing will be like that. Ops
> > tools, config, documentation, etc will no longer be able to include any
> > coverage of ZK because we can't assume ZK so all that becomes much
> harder.
> > The second problem is that this introduces a combinatorial testing
> problem.
> > People say they want to swap out ZK but they are assuming whatever they
> > swap in will work equally well. How will we know that is true? The only
> way
> > to explode out the testing to run with every possible plugin.
> >
> > If you want to see this in action take a look at ActiveMQ. ActiveMQ is
> less
> > a system than a family of co-operating plugins and a configuration
> language
> > for assembling them. Software engineers and open source communities are
> > really prone to this kind of thing because "we can just make it
> pluggable"
> > ends any argument. But the actual implementation is a mess, and later
> > improvements in their threading, I/O, and other core models simply
>

Jenkins build is back to normal : kafka-trunk-jdk7 #868

See

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Onur Karaman

+1 on what Neha said.

On Tue, Dec 1, 2015 at 1:41 PM, Neha Narkhede  wrote:

> I share Jay's concerns around plugins and he explained that very well, so I
> will avoid repeating those.
>
> Agree that there is some feedback about getting rid of ZooKeeper altogether
> and I'm on board with building a native version whenever we are ready to
> take this on. But I don't think the pain is big enough to solve this by
> providing a short-term solution via plugins.
>
> With respect to ZooKeeper, a more significant issue is performance and
> correct use of ZooKeeper APIs. First, switching to using the bulk read and
> write APIs, that ZooKeeper has released a while ago, will make a lot of
> things around failover better and faster. Second, there is a value in not
> wrapping the core ZooKeeper APIs in third-party plugins (whether it is
> ZkClient or Curator) since they try to mask functionality in ways that end
> up making it very tricky to write correct code. Case in point:
> https://issues.apache.org/jira/browse/KAFKA-1387. I suspect there are
> places in our code that still don't use the ZK watcher functionality
> correctly with the impact being losing important notifications and not
> acting on certain state changes at all. This is because we depend on
> ZkClient and it ends up hiding some details of the ZooKeeper API that
> shouldn't be hidden to handle such cases correctly.
>
> My concern is that this is a problem we will have with every pluggable
> implementation. Today, a lot of our code depends on the sort of guarantees
> that ZooKeeper provides around watches and ordering. I don't know the other
> systems well enough to say whether they would be able to provide similar
> guarantees around all the operations we'd want to support.
>
> A better use of effort is to focus on fixing our use of ZooKeeper until we
> can come back and replace it with the native implementation.
>
> On Tue, Dec 1, 2015 at 12:25 PM, Joe Stein  wrote:
>
> > Yeah, lets do both! :) I always had trepidations about leaving things as
> is
> > with ZooKeeper there. Can we have this new internal system be what
> replaces
> > that but still make it modular somewhat.
> >
> > The problem with any new system is that everyone already trusts and
> relies
> > on the existing scars we know heal. That is why we all are still using
> > ZooKeeper ( I bet at least 3 clusters are still on 3.3.4 and one maybe
> > 3.3.1 or something nutty ).
> >
> > etcd
> > consul
> > c*
> > riak
> > akka
> >
> > All have viable solutions and i have no idea what will be best or worst
> or
> > even work but lots of folks are working on it now trying to get things to
> > be different and work right for them.
> >
> > I think a native version should be there in the project and I am 100% on
> > board with that native version NOT be ZooKeeper but homegrown.
> >
> > I also think the native default should use the KIP-30 interface so other
> > server can also connect the feature they are solving also (that way
> > deployments that have already adopted XYZ for consensus can use it).
> >
> > ~ Joe Stein
> > - - - - - - - - - - - - - - - - - - -
> >  [image: Logo-Black.jpg]
> >   http://www.elodina.net
> > http://www.stealth.ly
> > - - - - - - - - - - - - - - - - - - -
> >
> > On Tue, Dec 1, 2015 at 2:58 PM, Jay Kreps  wrote:
> >
> > > Hey Joe,
> > >
> > > Thanks for raising this. People really want to get rid of the ZK
> > > dependency, I agree it is among the most asked for things. Let me give
> a
> > > quick critique and a more radical plan.
> > >
> > > I don't think making ZK pluggable is the right thing to do. I have a
> lot
> > of
> > > experience with this dynamic of introducing plugins for core
> > functionality
> > > because I previously worked on a key-value store called Voldemort in
> > which
> > > we made both the protocol and storage engine totally pluggable. I
> > > originally felt this was a good thing both philosophically and
> > practically,
> > > but in retrospect came to believe it was a huge mistake--what people
> > really
> > > wanted was one really excellent implementation with the kind of insane
> > > levels of in-production usage and test coverage that infrastructure
> > > demands. Pluggability is actually really at odds with this, and the
> > ability
> > > to actually abstract over some really meaty dependency like a storage
> > > engine never quite works.
> > >
> > > People dislike the ZK dependency because it effectively doubles the
> > > operational load of Kafka--it doubles the amount of configuration,
> > > monitoring, and understanding needed. Replacing ZK with a similar
> system
> > > won't fix this problem though--all the other consensus services are
> > equally
> > > complex (and often less mature)--and it will cause two new problems.
> > First
> > > there will be a layer of indirection that will make reasoning and
> > improving
> > > the ZK implementation harder. For example, note that your plug-in api
> > > doesn't seem to cover multi-get and

Build failed in Jenkins: kafka-trunk-jdk8 #197

See 

Changes:

[wangguoz] MINOR: Update LinkedIn JVM tuning settings

[wangguoz] KAFKA-2915: Fix problem with System Tests that use bootstrap.servers

[wangguoz] MINOR - fix typo in index corruption warning message

[wangguoz] MINOR: Fix building from subproject directory

[wangguoz] KAFKA-2421: Upgrade LZ4 to version 1.3

--
[...truncated 4506 lines...]
org.apache.kafka.connect.json.JsonConverterTest > timeToJson PASSED

org.apache.kafka.connect.json.JsonConverterTest > structToJson PASSED

org.apache.kafka.connect.json.JsonConverterTest > 
testConnectSchemaMetadataTranslation PASSED

org.apache.kafka.connect.json.JsonConverterTest > shortToJson PASSED

org.apache.kafka.connect.json.JsonConverterTest > dateToConnect PASSED

org.apache.kafka.connect.json.JsonConverterTest > doubleToJson PASSED

org.apache.kafka.connect.json.JsonConverterTest > timeToConnect PASSED

org.apache.kafka.connect.json.JsonConverterTest > mapToConnectStringKeys PASSED

org.apache.kafka.connect.json.JsonConverterTest > floatToConnect PASSED

org.apache.kafka.connect.json.JsonConverterTest > decimalToJson PASSED

org.apache.kafka.connect.json.JsonConverterTest > arrayToConnect PASSED

org.apache.kafka.connect.json.JsonConverterTest > 
testCacheSchemaToConnectConversion PASSED

org.apache.kafka.connect.json.JsonConverterTest > booleanToJson PASSED

org.apache.kafka.connect.json.JsonConverterTest > bytesToConnect PASSED

org.apache.kafka.connect.json.JsonConverterTest > doubleToConnect PASSED
:connect:runtime:checkstyleMain
:connect:runtime:compileTestJavawarning: [options] bootstrap class path not set 
in conjunction with -source 1.7
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
1 warning

:connect:runtime:processTestResources
:connect:runtime:testClasses
:connect:runtime:checkstyleTest
:connect:runtime:test

org.apache.kafka.connect.util.KafkaBasedLogTest > testStartStop PASSED

org.apache.kafka.connect.util.KafkaBasedLogTest > testSendAndReadToEnd FAILED
org.junit.ComparisonFailure: expected: but was:
at org.junit.Assert.assertEquals(Assert.java:115)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.kafka.connect.util.KafkaBasedLogTest.testSendAndReadToEnd(KafkaBasedLogTest.java:312)

org.apache.kafka.connect.util.KafkaBasedLogTest > testReloadOnStart PASSED

org.apache.kafka.connect.util.KafkaBasedLogTest > testConsumerError PASSED

org.apache.kafka.connect.util.KafkaBasedLogTest > testProducerError PASSED

org.apache.kafka.connect.util.ShutdownableThreadTest > testGracefulShutdown 
PASSED

org.apache.kafka.connect.util.ShutdownableThreadTest > testForcibleShutdown 
PASSED

org.apache.kafka.connect.storage.KafkaOffsetBackingStoreTest > testStartStop 
PASSED

org.apache.kafka.connect.storage.KafkaOffsetBackingStoreTest > 
testReloadOnStart PASSED

org.apache.kafka.connect.storage.KafkaOffsetBackingStoreTest > testMissingTopic 
PASSED

org.apache.kafka.connect.storage.KafkaOffsetBackingStoreTest > testSetFailure 
PASSED

org.apache.kafka.connect.storage.KafkaOffsetBackingStoreTest > testGetSet PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > testWriteFlush PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > 
testWriteNullValueFlush PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > 
testWriteNullKeyFlush PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > testNoOffsetsToFlush 
PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > 
testFlushFailureReplacesOffsets PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > testAlreadyFlushing 
PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > 
testCancelBeforeAwaitFlush PASSED

org.apache.kafka.connect.storage.OffsetStorageWriterTest > 
testCancelAfterAwaitFlush PASSED

org.apache.kafka.connect.storage.FileOffsetBackingStoreTest > testSaveRestore 
PASSED

org.apache.kafka.connect.storage.FileOffsetBackingStoreTest > testGetSet PASSED

org.apache.kafka.connect.storage.KafkaConfigStorageTest > testStartStop PASSED

org.apache.kafka.connect.storage.KafkaConfigStorageTest > 
testPutTaskConfigsDoesNotResolveAllInconsistencies PASSED

org.apache.kafka.connect.storage.KafkaConfigStorageTest > 
testPutConnectorConfig PASSED

org.apache.kafka.connect.storage.KafkaConfigStorageTest > testPutTaskConfigs 
PASSED

org.apache.kafka.connect.storage.KafkaConfigStorageTest > testRestore PASSED

org.apache.kafka.connect.runtime.WorkerTest > testStopInvalidConnector PASSED

org.apache.kafka.connect.runtime.WorkerTest > testAddRemoveConnector PASSED

org.apache.kafka.connect.runtime.WorkerTest > testCleanupTasksOnStop PASSED

org.apache.kafka.connect.runtime.WorkerTest > testReconfigureConnectorTasks 
PASSED

org.apache.kafka.connect.runtime.WorkerTest > testAddRemoveTask PASSED

org.apache.kafka.conn

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Neha Narkhede

I share Jay's concerns around plugins and he explained that very well, so I
will avoid repeating those.

Agree that there is some feedback about getting rid of ZooKeeper altogether
and I'm on board with building a native version whenever we are ready to
take this on. But I don't think the pain is big enough to solve this by
providing a short-term solution via plugins.

With respect to ZooKeeper, a more significant issue is performance and
correct use of ZooKeeper APIs. First, switching to using the bulk read and
write APIs, that ZooKeeper has released a while ago, will make a lot of
things around failover better and faster. Second, there is a value in not
wrapping the core ZooKeeper APIs in third-party plugins (whether it is
ZkClient or Curator) since they try to mask functionality in ways that end
up making it very tricky to write correct code. Case in point:
https://issues.apache.org/jira/browse/KAFKA-1387. I suspect there are
places in our code that still don't use the ZK watcher functionality
correctly with the impact being losing important notifications and not
acting on certain state changes at all. This is because we depend on
ZkClient and it ends up hiding some details of the ZooKeeper API that
shouldn't be hidden to handle such cases correctly.

My concern is that this is a problem we will have with every pluggable
implementation. Today, a lot of our code depends on the sort of guarantees
that ZooKeeper provides around watches and ordering. I don't know the other
systems well enough to say whether they would be able to provide similar
guarantees around all the operations we'd want to support.

A better use of effort is to focus on fixing our use of ZooKeeper until we
can come back and replace it with the native implementation.

On Tue, Dec 1, 2015 at 12:25 PM, Joe Stein  wrote:

> Yeah, lets do both! :) I always had trepidations about leaving things as is
> with ZooKeeper there. Can we have this new internal system be what replaces
> that but still make it modular somewhat.
>
> The problem with any new system is that everyone already trusts and relies
> on the existing scars we know heal. That is why we all are still using
> ZooKeeper ( I bet at least 3 clusters are still on 3.3.4 and one maybe
> 3.3.1 or something nutty ).
>
> etcd
> consul
> c*
> riak
> akka
>
> All have viable solutions and i have no idea what will be best or worst or
> even work but lots of folks are working on it now trying to get things to
> be different and work right for them.
>
> I think a native version should be there in the project and I am 100% on
> board with that native version NOT be ZooKeeper but homegrown.
>
> I also think the native default should use the KIP-30 interface so other
> server can also connect the feature they are solving also (that way
> deployments that have already adopted XYZ for consensus can use it).
>
> ~ Joe Stein
> - - - - - - - - - - - - - - - - - - -
>  [image: Logo-Black.jpg]
>   http://www.elodina.net
> http://www.stealth.ly
> - - - - - - - - - - - - - - - - - - -
>
> On Tue, Dec 1, 2015 at 2:58 PM, Jay Kreps  wrote:
>
> > Hey Joe,
> >
> > Thanks for raising this. People really want to get rid of the ZK
> > dependency, I agree it is among the most asked for things. Let me give a
> > quick critique and a more radical plan.
> >
> > I don't think making ZK pluggable is the right thing to do. I have a lot
> of
> > experience with this dynamic of introducing plugins for core
> functionality
> > because I previously worked on a key-value store called Voldemort in
> which
> > we made both the protocol and storage engine totally pluggable. I
> > originally felt this was a good thing both philosophically and
> practically,
> > but in retrospect came to believe it was a huge mistake--what people
> really
> > wanted was one really excellent implementation with the kind of insane
> > levels of in-production usage and test coverage that infrastructure
> > demands. Pluggability is actually really at odds with this, and the
> ability
> > to actually abstract over some really meaty dependency like a storage
> > engine never quite works.
> >
> > People dislike the ZK dependency because it effectively doubles the
> > operational load of Kafka--it doubles the amount of configuration,
> > monitoring, and understanding needed. Replacing ZK with a similar system
> > won't fix this problem though--all the other consensus services are
> equally
> > complex (and often less mature)--and it will cause two new problems.
> First
> > there will be a layer of indirection that will make reasoning and
> improving
> > the ZK implementation harder. For example, note that your plug-in api
> > doesn't seem to cover multi-get and multi-write, when we added that we
> > would end up breaking all plugins. Each new thing will be like that. Ops
> > tools, config, documentation, etc will no longer be able to include any
> > coverage of ZK because we can't assume ZK so all that becomes much
> harder.
> > The second problem

Re: 0.9.0.0 release notes is opening download mirrors page

2015-12-01 Thread Kris K

Thanks Jun. Missed that part, my bad.

Regards,
Kris

On Mon, Nov 30, 2015 at 4:17 PM, Jun Rao  wrote:

> Kris,
>
> It just points to the mirror site. If you click on one of the links, you
> will see the release notes.
>
> Thanks,
>
> Jun
>
> On Mon, Nov 30, 2015 at 1:37 PM, Kris K  wrote:
>
> > Hi,
> >
> > Just noticed that the Release notes link of 0.9.0.0 is pointing to the
> > download mirrors page.
> >
> >
> >
> https://www.apache.org/dyn/closer.cgi?path=/kafka/0.9.0.0/RELEASE_NOTES.html
> >
> >
> > Thanks,
> > Kris K
> >
>

[jira] [Assigned] (KAFKA-2896) System test for partition re-assignment


 [ 
https://issues.apache.org/jira/browse/KAFKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anna Povzner reassigned KAFKA-2896:
---

Assignee: Anna Povzner

> System test for partition re-assignment
> ---
>
> Key: KAFKA-2896
> URL: https://issues.apache.org/jira/browse/KAFKA-2896
> Project: Kafka
>  Issue Type: Task
>Reporter: Gwen Shapira
>Assignee: Anna Povzner
>
> Lots of users depend on partition re-assignment tool to manage their cluster. 
> Will be nice to have a simple system tests that creates a topic with few 
> partitions and few replicas, reassigns everything and validates the ISR 
> afterwards. 
> Just to make sure we are not breaking anything. Especially since we have 
> plans to improve (read: modify) this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-2896) System test for partition re-assignment


 [ 
https://issues.apache.org/jira/browse/KAFKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-2896 started by Anna Povzner.
---
> System test for partition re-assignment
> ---
>
> Key: KAFKA-2896
> URL: https://issues.apache.org/jira/browse/KAFKA-2896
> Project: Kafka
>  Issue Type: Task
>Reporter: Gwen Shapira
>Assignee: Anna Povzner
>
> Lots of users depend on partition re-assignment tool to manage their cluster. 
> Will be nice to have a simple system tests that creates a topic with few 
> partitions and few replicas, reassigns everything and validates the ISR 
> afterwards. 
> Just to make sure we are not breaking anything. Especially since we have 
> plans to improve (read: modify) this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2896) System test for partition re-assignment


[ 
https://issues.apache.org/jira/browse/KAFKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034606#comment-15034606
 ] 

Anna Povzner commented on KAFKA-2896:
-

[~jinxing6...@126.com] I am already working on it, sorry I did not assign it to 
myself earlier.

> System test for partition re-assignment
> ---
>
> Key: KAFKA-2896
> URL: https://issues.apache.org/jira/browse/KAFKA-2896
> Project: Kafka
>  Issue Type: Task
>Reporter: Gwen Shapira
>Assignee: Anna Povzner
>
> Lots of users depend on partition re-assignment tool to manage their cluster. 
> Will be nice to have a simple system tests that creates a topic with few 
> partitions and few replicas, reassigns everything and validates the ISR 
> afterwards. 
> Just to make sure we are not breaking anything. Especially since we have 
> plans to improve (read: modify) this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: kafka_0.9.0_jdk7 #53

See 

Changes:

[wangguoz] MINOR - fix typo in index corruption warning message

--
[...truncated 214 lines...]
  new UpdateMetadataRequest.BrokerEndPoint(brokerEndPoint.id, 
brokerEndPoint.host, brokerEndPoint.port)
^
:391:
 constructor UpdateMetadataRequest in class UpdateMetadataRequest is 
deprecated: see corresponding Javadoc for more information.
new UpdateMetadataRequest(controllerId, controllerEpoch, 
liveBrokers.asJava, partitionStates.asJava)
^
:129:
 method readFromReadableChannel in class NetworkReceive is deprecated: see 
corresponding Javadoc for more information.
  response.readFromReadableChannel(channel)
   ^
there were 15 feature warning(s); re-run with -feature for details
14 warnings found
:core:processResources UP-TO-DATE
:core:classes
:clients:compileTestJava UP-TO-DATE
:clients:processTestResources UP-TO-DATE
:clients:testClasses UP-TO-DATE
:core:copyDependantLibs UP-TO-DATE
:core:copyDependantTestLibs
:core:jar UP-TO-DATE
:examples:compileJava
:examples:processResources UP-TO-DATE
:examples:classes
:examples:jar
:log4j-appender:compileJava
:log4j-appender:processResources UP-TO-DATE
:log4j-appender:classes
:log4j-appender:jar
:tools:compileJava
:tools:processResources UP-TO-DATE
:tools:classes
:clients:javadoc
:clients:javadocJar
:clients:srcJar
:clients:testJar
:clients:signArchives SKIPPED
:tools:copyDependantLibs
:tools:jar
:connect:api:compileJavaNote: 

 uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

:connect:api:processResources UP-TO-DATE
:connect:api:classes
:connect:api:copyDependantLibs
:connect:api:jar
:connect:file:compileJava
:connect:file:processResources UP-TO-DATE
:connect:file:classes
:connect:file:copyDependantLibs
:connect:file:jar
:connect:json:compileJavaNote: 

 uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:connect:json:processResources UP-TO-DATE
:connect:json:classes
:connect:json:copyDependantLibs
:connect:json:jar
:connect:runtime:compileJavaNote: Some input files use unchecked or unsafe 
operations.
Note: Recompile with -Xlint:unchecked for details.

:connect:runtime:processResources UP-TO-DATE
:connect:runtime:classes
:connect:runtime:copyDependantLibs
:connect:runtime:jar
:jarAll
:test_core_2_10_5
Building project 'core' with Scala version 2.10.5
:kafka_0.9.0_jdk7:clients:compileJava UP-TO-DATE
:kafka_0.9.0_jdk7:clients:processResources UP-TO-DATE
:kafka_0.9.0_jdk7:clients:classes UP-TO-DATE
:kafka_0.9.0_jdk7:clients:determineCommitId UP-TO-DATE
:kafka_0.9.0_jdk7:clients:createVersionFile
:kafka_0.9.0_jdk7:clients:jar UP-TO-DATE
:kafka_0.9.0_jdk7:clients:compileTestJava UP-TO-DATE
:kafka_0.9.0_jdk7:clients:processTestResources UP-TO-DATE
:kafka_0.9.0_jdk7:clients:testClasses UP-TO-DATE
:kafka_0.9.0_jdk7:core:compileJava UP-TO-DATE
:kafka_0.9.0_jdk7:core:compileScala UP-TO-DATE
:kafka_0.9.0_jdk7:core:processResources UP-TO-DATE
:kafka_0.9.0_jdk7:core:classes UP-TO-DATE
:kafka_0.9.0_jdk7:core:compileTestJava UP-TO-DATE
:kafka_0.9.0_jdk7:core:compileTestScala
:test_core_2_10_5 FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Failed to capture snapshot of input files for task 'compileTestScala' during 
up-to-date check.  See stacktrace for details.
> Could not add entry 
> '
>  to cache fileHashes.bin 
> (

* Try:
Run with --info or --debug option to get more log output.

* Exception is:
org.gradle.api.UncheckedIOException: Failed to capture snapshot of input files 
for task 'compileTestScala' during up-to-date check.  See stacktrace for 
details.
at 
org.gradle.api.internal.changedetection.rules.TaskUpToDateState.(TaskUpToDateState.java:59)
at 
org.gradle.api.internal.changedetection.changes.DefaultTaskArtifactStateRepository$TaskArtifactStateImpl.getStates(DefaultTaskArtifactStateRepository.java:126)
at 
org.gradle.api.internal.changedetection.changes.DefaultTaskArtifactStateRepository$TaskArtifactStateImpl.isUpToDate(DefaultTaskArtifactStateRepository.java:69)
at 
org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter.execute(SkipUpToDateTaskExecute

Jenkins build is back to normal : kafka_0.9.0_jdk7 #52

See

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Joe Stein

Yeah, lets do both! :) I always had trepidations about leaving things as is
with ZooKeeper there. Can we have this new internal system be what replaces
that but still make it modular somewhat.

The problem with any new system is that everyone already trusts and relies
on the existing scars we know heal. That is why we all are still using
ZooKeeper ( I bet at least 3 clusters are still on 3.3.4 and one maybe
3.3.1 or something nutty ).

etcd
consul
c*
riak
akka

All have viable solutions and i have no idea what will be best or worst or
even work but lots of folks are working on it now trying to get things to
be different and work right for them.

I think a native version should be there in the project and I am 100% on
board with that native version NOT be ZooKeeper but homegrown.

I also think the native default should use the KIP-30 interface so other
server can also connect the feature they are solving also (that way
deployments that have already adopted XYZ for consensus can use it).

~ Joe Stein
- - - - - - - - - - - - - - - - - - -
 [image: Logo-Black.jpg]
  http://www.elodina.net
http://www.stealth.ly
- - - - - - - - - - - - - - - - - - -

On Tue, Dec 1, 2015 at 2:58 PM, Jay Kreps  wrote:

> Hey Joe,
>
> Thanks for raising this. People really want to get rid of the ZK
> dependency, I agree it is among the most asked for things. Let me give a
> quick critique and a more radical plan.
>
> I don't think making ZK pluggable is the right thing to do. I have a lot of
> experience with this dynamic of introducing plugins for core functionality
> because I previously worked on a key-value store called Voldemort in which
> we made both the protocol and storage engine totally pluggable. I
> originally felt this was a good thing both philosophically and practically,
> but in retrospect came to believe it was a huge mistake--what people really
> wanted was one really excellent implementation with the kind of insane
> levels of in-production usage and test coverage that infrastructure
> demands. Pluggability is actually really at odds with this, and the ability
> to actually abstract over some really meaty dependency like a storage
> engine never quite works.
>
> People dislike the ZK dependency because it effectively doubles the
> operational load of Kafka--it doubles the amount of configuration,
> monitoring, and understanding needed. Replacing ZK with a similar system
> won't fix this problem though--all the other consensus services are equally
> complex (and often less mature)--and it will cause two new problems. First
> there will be a layer of indirection that will make reasoning and improving
> the ZK implementation harder. For example, note that your plug-in api
> doesn't seem to cover multi-get and multi-write, when we added that we
> would end up breaking all plugins. Each new thing will be like that. Ops
> tools, config, documentation, etc will no longer be able to include any
> coverage of ZK because we can't assume ZK so all that becomes much harder.
> The second problem is that this introduces a combinatorial testing problem.
> People say they want to swap out ZK but they are assuming whatever they
> swap in will work equally well. How will we know that is true? The only way
> to explode out the testing to run with every possible plugin.
>
> If you want to see this in action take a look at ActiveMQ. ActiveMQ is less
> a system than a family of co-operating plugins and a configuration language
> for assembling them. Software engineers and open source communities are
> really prone to this kind of thing because "we can just make it pluggable"
> ends any argument. But the actual implementation is a mess, and later
> improvements in their threading, I/O, and other core models simply couldn't
> be made across all the plugins.
>
> This blog post on configurability in UI is a really good summary of a
> similar dynamic:
> http://ometer.com/free-software-ui.html
>
> Anyhow, not to go too far off on a rant. Clearly I have plugin PTSD :-)
>
> I think instead we should explore the idea of getting rid of the zookeeper
> dependency and replace it with an internal facility. Let me explain what I
> mean. In terms of API what Kafka and ZK do is super different, but
> internally it is actually quite similar--they are both trying to maintain a
> CP log.
>
> What would actually make the system significantly simpler would be to
> reimplement the facilities you describe on top of Kafka's existing
> infrastructure--using the same log implementation, network stack, config,
> monitoring, etc. If done correctly this would dramatically lower the
> operational load of the system versus the current Kafka+ZK or proposed
> Kafka+X.
>
> I don't have a proposal for how this would work and it's some effort to
> scope it out. The obvious thing to do would just be to keep the existing
> ISR/Controller setup and rebuild the controller etc on a RAFT/Paxos impl
> using the Kafka network/log/etc and have a replicated config databas

[jira] [Commented] (KAFKA-2308) New producer + Snappy face un-compression errors after broker restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034497#comment-15034497
 ] 

Guozhang Wang commented on KAFKA-2308:
--

[~allenxwang] This bug should be only in the producer due to its use patterns 
of snappy.

> New producer + Snappy face un-compression errors after broker restart
> -
>
> Key: KAFKA-2308
> URL: https://issues.apache.org/jira/browse/KAFKA-2308
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>Assignee: Gwen Shapira
> Fix For: 0.9.0.0, 0.8.2.2
>
> Attachments: KAFKA-2308.patch
>
>
> Looks like the new producer, when used with Snappy, following a broker 
> restart is sending messages the brokers can't decompress. This issue was 
> discussed at few mailing lists thread, but I don't think we ever resolved it.
> I can reproduce with trunk and Snappy 1.1.1.7. 
> To reproduce:
> 1. Start 3 brokers
> 2. Create a topic with 3 partitions and 3 replicas each.
> 2. Start performance producer with --new-producer --compression-codec 2 (and 
> set the number of messages to fairly high, to give you time. I went with 10M)
> 3. Bounce one of the brokers
> 4. The log of one of the surviving nodes should contain errors like:
> {code}
> 2015-07-02 13:45:59,300 ERROR kafka.server.ReplicaManager: [Replica Manager 
> on Broker 66]: Error processing append operation on partition [t3,0]
> kafka.common.KafkaException:
> at 
> kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:94)
> at 
> kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:64)
> at 
> kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
> at 
> kafka.message.ByteBufferMessageSet$$anon$2.innerDone(ByteBufferMessageSet.scala:177)
> at 
> kafka.message.ByteBufferMessageSet$$anon$2.makeNext(ByteBufferMessageSet.scala:218)
> at 
> kafka.message.ByteBufferMessageSet$$anon$2.makeNext(ByteBufferMessageSet.scala:173)
> at 
> kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> kafka.message.ByteBufferMessageSet.validateMessagesAndAssignOffsets(ByteBufferMessageSet.scala:267)
> at kafka.log.Log.liftedTree1$1(Log.scala:327)
> at kafka.log.Log.append(Log.scala:326)
> at 
> kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:423)
> at 
> kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:409)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
> at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:268)
> at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:409)
> at 
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:365)
> at 
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:350)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:350)

[jira] [Updated] (KAFKA-2421) Upgrade LZ4 to version 1.3 to avoid crashing with IBM Java 7


 [ 
https://issues.apache.org/jira/browse/KAFKA-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-2421:
-
   Resolution: Fixed
Fix Version/s: 0.9.1.0
   Status: Resolved  (was: Patch Available)

> Upgrade LZ4 to version 1.3 to avoid crashing with IBM Java 7
> 
>
> Key: KAFKA-2421
> URL: https://issues.apache.org/jira/browse/KAFKA-2421
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
> Environment: IBM Java 7
>Reporter: Rajini Sivaram
>Assignee: Grant Henke
> Fix For: 0.9.1.0
>
> Attachments: KAFKA-2421.patch, KAFKA-2421_2015-08-11_18:54:26.patch, 
> kafka-2421_2015-09-08_11:38:03.patch
>
>
> Upgrade LZ4 to version 1.3 to avoid crashing with IBM Java 7.
> LZ4 version 1.2 crashes with 64-bit IBM Java 7. This has been fixed in LZ4 
> version 1.3 (https://github.com/jpountz/lz4-java/blob/master/CHANGES.md, 
> https://github.com/jpountz/lz4-java/pull/46).
> The unit test org.apache.kafka.common.record.MemoryRecordsTest crashes when 
> run with 64-bit IBM Java7 with the error:
> {quote}
> 023EB900: Native Method 0263CE10 
> (net/jpountz/lz4/LZ4JNI.LZ4_compress_limitedOutput([BII[BII)I)
> 023EB900: Invalid JNI call of function void 
> ReleasePrimitiveArrayCritical(JNIEnv *env, jarray array, void *carray, jint 
> mode): For array FFF7EAB8 parameter carray passed FFF85998, 
> expected to be FFF7EAC0
> 14:08:42.763 0x23eb900j9mm.632*   ** ASSERTION FAILED ** at 
> StandardAccessBarrier.cpp:335: ((false))
> JVMDUMP039I Processing dump event "traceassert", detail "" at 2015/08/11 
> 15:08:42 - please wait.
> {quote}
> Stack trace from javacore:
> 3XMTHREADINFO3   Java callstack:
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4JNI.LZ4_compress_limitedOutput(Native Method)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4JNICompressor.compress(LZ4JNICompressor.java:31)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.(LZ4Factory.java:163)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.instance(LZ4Factory.java:46)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.nativeInstance(LZ4Factory.java:76)
> 5XESTACKTRACE   (entered lock: 
> net/jpountz/lz4/LZ4Factory@0xE02F0BE8, entry count: 1)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.fastestInstance(LZ4Factory.java:129)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/KafkaLZ4BlockOutputStream.(KafkaLZ4BlockOutputStream.java:72)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/KafkaLZ4BlockOutputStream.(KafkaLZ4BlockOutputStream.java:93)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/KafkaLZ4BlockOutputStream.(KafkaLZ4BlockOutputStream.java:103)
> 4XESTACKTRACEat 
> sun/reflect/NativeConstructorAccessorImpl.newInstance0(Native Method)
> 4XESTACKTRACEat 
> sun/reflect/NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:86)
> 4XESTACKTRACEat 
> sun/reflect/DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:58)
> 4XESTACKTRACEat 
> java/lang/reflect/Constructor.newInstance(Constructor.java:542)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/Compressor.wrapForOutput(Compressor.java:222)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/Compressor.(Compressor.java:72)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/Compressor.(Compressor.java:76)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecords.(MemoryRecords.java:43)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecords.emptyRecords(MemoryRecords.java:51)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecords.emptyRecords(MemoryRecords.java:55)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecordsTest.testIterator(MemoryRecordsTest.java:42)
> java -version
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build pxa6470_27sr3fp1-20150605_01(SR3 FP1))
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20150407_243189 (JIT enabled, AOT enabled)
> J9VM - R27_Java727_SR3_20150407_1831_B243189
> JIT  - tr.r13.java_20150406_89182
> GC   - R27_Java727_SR3_20150407_1831_B243189_CMPRSS
> J9CL - 20150407_243189)
> JCL - 20150601_01 based on Oracle 7u79-b14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2421) Upgrade LZ4 to version 1.3 to avoid crashing with IBM Java 7


[ 
https://issues.apache.org/jira/browse/KAFKA-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034476#comment-15034476
 ] 

ASF GitHub Bot commented on KAFKA-2421:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/552


> Upgrade LZ4 to version 1.3 to avoid crashing with IBM Java 7
> 
>
> Key: KAFKA-2421
> URL: https://issues.apache.org/jira/browse/KAFKA-2421
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
> Environment: IBM Java 7
>Reporter: Rajini Sivaram
>Assignee: Grant Henke
> Attachments: KAFKA-2421.patch, KAFKA-2421_2015-08-11_18:54:26.patch, 
> kafka-2421_2015-09-08_11:38:03.patch
>
>
> Upgrade LZ4 to version 1.3 to avoid crashing with IBM Java 7.
> LZ4 version 1.2 crashes with 64-bit IBM Java 7. This has been fixed in LZ4 
> version 1.3 (https://github.com/jpountz/lz4-java/blob/master/CHANGES.md, 
> https://github.com/jpountz/lz4-java/pull/46).
> The unit test org.apache.kafka.common.record.MemoryRecordsTest crashes when 
> run with 64-bit IBM Java7 with the error:
> {quote}
> 023EB900: Native Method 0263CE10 
> (net/jpountz/lz4/LZ4JNI.LZ4_compress_limitedOutput([BII[BII)I)
> 023EB900: Invalid JNI call of function void 
> ReleasePrimitiveArrayCritical(JNIEnv *env, jarray array, void *carray, jint 
> mode): For array FFF7EAB8 parameter carray passed FFF85998, 
> expected to be FFF7EAC0
> 14:08:42.763 0x23eb900j9mm.632*   ** ASSERTION FAILED ** at 
> StandardAccessBarrier.cpp:335: ((false))
> JVMDUMP039I Processing dump event "traceassert", detail "" at 2015/08/11 
> 15:08:42 - please wait.
> {quote}
> Stack trace from javacore:
> 3XMTHREADINFO3   Java callstack:
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4JNI.LZ4_compress_limitedOutput(Native Method)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4JNICompressor.compress(LZ4JNICompressor.java:31)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.(LZ4Factory.java:163)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.instance(LZ4Factory.java:46)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.nativeInstance(LZ4Factory.java:76)
> 5XESTACKTRACE   (entered lock: 
> net/jpountz/lz4/LZ4Factory@0xE02F0BE8, entry count: 1)
> 4XESTACKTRACEat 
> net/jpountz/lz4/LZ4Factory.fastestInstance(LZ4Factory.java:129)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/KafkaLZ4BlockOutputStream.(KafkaLZ4BlockOutputStream.java:72)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/KafkaLZ4BlockOutputStream.(KafkaLZ4BlockOutputStream.java:93)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/KafkaLZ4BlockOutputStream.(KafkaLZ4BlockOutputStream.java:103)
> 4XESTACKTRACEat 
> sun/reflect/NativeConstructorAccessorImpl.newInstance0(Native Method)
> 4XESTACKTRACEat 
> sun/reflect/NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:86)
> 4XESTACKTRACEat 
> sun/reflect/DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:58)
> 4XESTACKTRACEat 
> java/lang/reflect/Constructor.newInstance(Constructor.java:542)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/Compressor.wrapForOutput(Compressor.java:222)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/Compressor.(Compressor.java:72)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/Compressor.(Compressor.java:76)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecords.(MemoryRecords.java:43)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecords.emptyRecords(MemoryRecords.java:51)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecords.emptyRecords(MemoryRecords.java:55)
> 4XESTACKTRACEat 
> org/apache/kafka/common/record/MemoryRecordsTest.testIterator(MemoryRecordsTest.java:42)
> java -version
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build pxa6470_27sr3fp1-20150605_01(SR3 FP1))
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20150407_243189 (JIT enabled, AOT enabled)
> J9VM - R27_Java727_SR3_20150407_1831_B243189
> JIT  - tr.r13.java_20150406_89182
> GC   - R27_Java727_SR3_20150407_1831_B243189_CMPRSS
> J9CL - 20150407_243189)
> JCL - 20150601_01 based on Oracle 7u79-b14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Jay Kreps

Hey Joe,

Thanks for raising this. People really want to get rid of the ZK
dependency, I agree it is among the most asked for things. Let me give a
quick critique and a more radical plan.

I don't think making ZK pluggable is the right thing to do. I have a lot of
experience with this dynamic of introducing plugins for core functionality
because I previously worked on a key-value store called Voldemort in which
we made both the protocol and storage engine totally pluggable. I
originally felt this was a good thing both philosophically and practically,
but in retrospect came to believe it was a huge mistake--what people really
wanted was one really excellent implementation with the kind of insane
levels of in-production usage and test coverage that infrastructure
demands. Pluggability is actually really at odds with this, and the ability
to actually abstract over some really meaty dependency like a storage
engine never quite works.

People dislike the ZK dependency because it effectively doubles the
operational load of Kafka--it doubles the amount of configuration,
monitoring, and understanding needed. Replacing ZK with a similar system
won't fix this problem though--all the other consensus services are equally
complex (and often less mature)--and it will cause two new problems. First
there will be a layer of indirection that will make reasoning and improving
the ZK implementation harder. For example, note that your plug-in api
doesn't seem to cover multi-get and multi-write, when we added that we
would end up breaking all plugins. Each new thing will be like that. Ops
tools, config, documentation, etc will no longer be able to include any
coverage of ZK because we can't assume ZK so all that becomes much harder.
The second problem is that this introduces a combinatorial testing problem.
People say they want to swap out ZK but they are assuming whatever they
swap in will work equally well. How will we know that is true? The only way
to explode out the testing to run with every possible plugin.

If you want to see this in action take a look at ActiveMQ. ActiveMQ is less
a system than a family of co-operating plugins and a configuration language
for assembling them. Software engineers and open source communities are
really prone to this kind of thing because "we can just make it pluggable"
ends any argument. But the actual implementation is a mess, and later
improvements in their threading, I/O, and other core models simply couldn't
be made across all the plugins.

This blog post on configurability in UI is a really good summary of a
similar dynamic:
http://ometer.com/free-software-ui.html

Anyhow, not to go too far off on a rant. Clearly I have plugin PTSD :-)

I think instead we should explore the idea of getting rid of the zookeeper
dependency and replace it with an internal facility. Let me explain what I
mean. In terms of API what Kafka and ZK do is super different, but
internally it is actually quite similar--they are both trying to maintain a
CP log.

What would actually make the system significantly simpler would be to
reimplement the facilities you describe on top of Kafka's existing
infrastructure--using the same log implementation, network stack, config,
monitoring, etc. If done correctly this would dramatically lower the
operational load of the system versus the current Kafka+ZK or proposed
Kafka+X.

I don't have a proposal for how this would work and it's some effort to
scope it out. The obvious thing to do would just be to keep the existing
ISR/Controller setup and rebuild the controller etc on a RAFT/Paxos impl
using the Kafka network/log/etc and have a replicated config database
(maybe rocksdb) that was fed off the log and shared by all nodes.

If done well this could have the advantage of potentially allowing us to
scale the number of partitions quite significantly (the k/v store would not
need to be all in memory), though you would likely still have limits on the
number of partitions per machine. This would make the minimum Kafka cluster
size be just your replication factor.

People tend to feel that implementing things like RAFT or Paxos is too hard
for mere mortals. But I actually think it is within our capabilities, and
our testing capabilities as well as experience with this type of thing have
improved to the point where we should not be scared off if it is the right
path.

This approach is likely more work then plugins (though maybe not, once you
factor in all the docs, testing, etc) but if done correctly it would be an
unambiguous step forward--a simpler, more scalable implementation with no
operational dependencies.

Thoughts?

-Jay

On Tue, Dec 1, 2015 at 11:12 AM, Joe Stein wrote:

> I would like to start a discussion around the work that has started in
> regards to KIP-30
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>
> The impetus for working on this came a lot from the community.

[GitHub] kafka pull request: KAFKA-2421: Upgrade LZ4 to version 1.3

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/552


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] kafka pull request: MINOR: Fix building from subproject directory

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/509


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] kafka pull request: MINOR - fix typo in index corruption warning m...

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/606


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (KAFKA-2837) FAILING TEST: kafka.api.ProducerBounceTest > testBrokerFailure


 [ 
https://issues.apache.org/jira/browse/KAFKA-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-2837:
-
Labels: newbie  (was: )

> FAILING TEST: kafka.api.ProducerBounceTest > testBrokerFailure 
> ---
>
> Key: KAFKA-2837
> URL: https://issues.apache.org/jira/browse/KAFKA-2837
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Gwen Shapira
>  Labels: newbie
>
> {code}
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> kafka.api.ProducerBounceTest.testBrokerFailure(ProducerBounceTest.scala:117)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:105)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:56)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:64)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:50)
>   at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   at 
> org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
>   at 
> org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
>   at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
>   at 
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:106)
>   at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   at 
> org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:360)
>   at 
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54)
>   at 
> org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> https://builds.apache.org/job/kafka-trunk-jdk7/815/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-2915) System Tests that use bootstrap.servers embedded in jinja files are not working


 [ 
https://issues.apache.org/jira/browse/KAFKA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-2915.
--
   Resolution: Fixed
Fix Version/s: 0.9.1.0

Issue resolved by pull request 608
[https://github.com/apache/kafka/pull/608]

> System Tests that use bootstrap.servers embedded in jinja files are not 
> working
> ---
>
> Key: KAFKA-2915
> URL: https://issues.apache.org/jira/browse/KAFKA-2915
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
> Fix For: 0.9.1.0
>
>
> Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2915) System Tests that use bootstrap.servers embedded in jinja files are not working


[ 
https://issues.apache.org/jira/browse/KAFKA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034462#comment-15034462
 ] 

ASF GitHub Bot commented on KAFKA-2915:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/608


> System Tests that use bootstrap.servers embedded in jinja files are not 
> working
> ---
>
> Key: KAFKA-2915
> URL: https://issues.apache.org/jira/browse/KAFKA-2915
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
> Fix For: 0.9.1.0
>
>
> Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: KAFKA-2915: Fix problem with System Tests that...

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/608


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (KAFKA-2837) FAILING TEST: kafka.api.ProducerBounceTest > testBrokerFailure


 [ 
https://issues.apache.org/jira/browse/KAFKA-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-2837:
-
Description: 
{code}
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
kafka.api.ProducerBounceTest.testBrokerFailure(ProducerBounceTest.scala:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:105)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:56)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:64)
at 
org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:50)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
at 
org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
at 
org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:106)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at 
org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:360)
at 
org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54)
at 
org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

https://builds.apache.org/job/kafka-trunk-jdk7/815/console

  was:
kafka.api.ProducerBounceTest > testBrokerFailure FAILED
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
kafka.api.ProducerBounceTest.testBrokerFailure(ProducerBounceTest.scala:117)

https://builds.apache.org/job/kafka-trunk-jdk7/815/console


> FAILING TEST: kafka.api.ProducerBounceTest > testBrokerFailure 
> ---
>
>

Build failed in Jenkins: kafka-trunk-jdk7 #867

See 

Changes:

[wangguoz] MINOR: Update LinkedIn JVM tuning settings

--
[...truncated 1793 lines...]
jdk1.7.0_51/jre/lib/amd64/libjava.so
jdk1.7.0_51/jre/lib/amd64/libJdbcOdbc.so
jdk1.7.0_51/jre/lib/amd64/fxavcodecplugin-52.so
jdk1.7.0_51/jre/lib/amd64/server/
jdk1.7.0_51/jre/lib/amd64/server/libjsig.so
jdk1.7.0_51/jre/lib/amd64/server/Xusage.txt
jdk1.7.0_51/jre/lib/amd64/server/libjvm.so
jdk1.7.0_51/jre/lib/amd64/fxplugins.so
jdk1.7.0_51/jre/lib/amd64/xawt/
jdk1.7.0_51/jre/lib/amd64/xawt/libmawt.so
jdk1.7.0_51/jre/lib/amd64/fxavcodecplugin-53.so
jdk1.7.0_51/jre/lib/amd64/libjaas_unix.so
jdk1.7.0_51/jre/lib/amd64/libgstreamer-lite.so
jdk1.7.0_51/jre/lib/amd64/libmanagement.so
jdk1.7.0_51/jre/lib/amd64/jvm.cfg
jdk1.7.0_51/jre/lib/amd64/libjsdt.so
jdk1.7.0_51/jre/lib/amd64/libzip.so
jdk1.7.0_51/jre/lib/amd64/libglass.so
jdk1.7.0_51/jre/lib/amd64/headless/
jdk1.7.0_51/jre/lib/amd64/headless/libmawt.so
jdk1.7.0_51/jre/lib/amd64/libjsound.so
jdk1.7.0_51/jre/lib/amd64/libjdwp.so
jdk1.7.0_51/jre/lib/amd64/libjawt.so
jdk1.7.0_51/jre/lib/fontconfig.SuSE.10.bfc
jdk1.7.0_51/jre/lib/classlist
jdk1.7.0_51/jre/lib/management-agent.jar
jdk1.7.0_51/jre/lib/javaws.jar
jdk1.7.0_51/jre/lib/psfontj2d.properties
jdk1.7.0_51/jre/lib/rt.jar
jdk1.7.0_51/jre/lib/calendars.properties
jdk1.7.0_51/jre/lib/security/
jdk1.7.0_51/jre/lib/security/local_policy.jar
jdk1.7.0_51/jre/lib/security/trusted.libraries
jdk1.7.0_51/jre/lib/security/javafx.policy
jdk1.7.0_51/jre/lib/security/cacerts
jdk1.7.0_51/jre/lib/security/java.policy
jdk1.7.0_51/jre/lib/security/US_export_policy.jar
jdk1.7.0_51/jre/lib/security/java.security
jdk1.7.0_51/jre/lib/security/blacklist
jdk1.7.0_51/jre/lib/security/javaws.policy
jdk1.7.0_51/jre/lib/jfxrt.jar
jdk1.7.0_51/jre/lib/fontconfig.Turbo.properties.src
jdk1.7.0_51/jre/lib/jfr/
jdk1.7.0_51/jre/lib/jfr/profile.jfc
jdk1.7.0_51/jre/lib/jfr/default.jfc
jdk1.7.0_51/jre/lib/fontconfig.RedHat.5.properties.src
jdk1.7.0_51/jre/lib/net.properties
jdk1.7.0_51/jre/lib/content-types.properties
jdk1.7.0_51/jre/lib/fontconfig.RedHat.5.bfc
jdk1.7.0_51/jre/lib/alt-rt.jar
jdk1.7.0_51/jre/lib/logging.properties
jdk1.7.0_51/jre/lib/applet/
jdk1.7.0_51/jre/lib/cmm/
jdk1.7.0_51/jre/lib/cmm/LINEAR_RGB.pf
jdk1.7.0_51/jre/lib/cmm/GRAY.pf
jdk1.7.0_51/jre/lib/cmm/sRGB.pf
jdk1.7.0_51/jre/lib/cmm/PYCC.pf
jdk1.7.0_51/jre/lib/cmm/CIEXYZ.pf
jdk1.7.0_51/jre/lib/locale/
jdk1.7.0_51/jre/lib/locale/zh_TW.BIG5/
jdk1.7.0_51/jre/lib/locale/zh_TW.BIG5/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/zh_TW.BIG5/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/pt_BR/
jdk1.7.0_51/jre/lib/locale/pt_BR/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/pt_BR/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/de/
jdk1.7.0_51/jre/lib/locale/de/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/de/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/it/
jdk1.7.0_51/jre/lib/locale/it/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/it/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/zh_TW/
jdk1.7.0_51/jre/lib/locale/zh_TW/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/zh_TW/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/zh/
jdk1.7.0_51/jre/lib/locale/zh/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/zh/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/ja/
jdk1.7.0_51/jre/lib/locale/ja/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/ja/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/fr/
jdk1.7.0_51/jre/lib/locale/fr/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/fr/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/ko/
jdk1.7.0_51/jre/lib/locale/ko/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/ko/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/ko.UTF-8/
jdk1.7.0_51/jre/lib/locale/ko.UTF-8/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/ko.UTF-8/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/zh.GBK/
jdk1.7.0_51/jre/lib/locale/zh.GBK/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/zh.GBK/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/sv/
jdk1.7.0_51/jre/lib/locale/sv/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/sv/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/zh_HK.BIG5HK/
jdk1.7.0_51/jre/lib/locale/zh_HK.BIG5HK/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/zh_HK.BIG5HK/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/locale/es/
jdk1.7.0_51/jre/lib/locale/es/LC_MESSAGES/
jdk1.7.0_51/jre/lib/locale/es/LC_MESSAGES/sunw_java_plugin.mo
jdk1.7.0_51/jre/lib/fonts/
jdk1.7.0_51/jre/lib/fonts/LucidaBrightDemiBold.ttf
jdk1.7.0_51/jre/lib/fonts/LucidaBrightItalic.ttf
jdk1.7.0_51/jre/lib/fonts/LucidaSansDemiBold.ttf
jdk1.7.0_51/jre/lib/fonts/LucidaTypewriterRegular.ttf
jdk1.7.0_51/jre/lib/fonts/LucidaBrightDemiItalic.ttf
jdk1.7.0_51/jre/lib/fonts/LucidaTypewriterBold.ttf
jdk1.7.0_51/jre/lib/fonts/LucidaSansRegular.ttf
jdk1.7.0_51/jre/lib/fonts/fonts.dir
jdk1.7.0_51/jre/lib/fonts/LucidaBrightRegular.ttf
jdk1.7.0_51/jre/lib/ext/
jdk1.7.0_51/jre/lib/ext

[jira] [Commented] (KAFKA-2837) FAILING TEST: kafka.api.ProducerBounceTest > testBrokerFailure


[ 
https://issues.apache.org/jira/browse/KAFKA-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034446#comment-15034446
 ] 

Guozhang Wang commented on KAFKA-2837:
--

Another example: 
https://builds.apache.org/job/kafka-trunk-git-pr-jdk7/1569/testReport/kafka.api/ProducerBounceTest/testBrokerFailure/

> FAILING TEST: kafka.api.ProducerBounceTest > testBrokerFailure 
> ---
>
> Key: KAFKA-2837
> URL: https://issues.apache.org/jira/browse/KAFKA-2837
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Gwen Shapira
>
> kafka.api.ProducerBounceTest > testBrokerFailure FAILED
> java.lang.AssertionError
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at 
> kafka.api.ProducerBounceTest.testBrokerFailure(ProducerBounceTest.scala:117)
> https://builds.apache.org/job/kafka-trunk-jdk7/815/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2923) Improve 0.9.0 Upgrade Documents

Guozhang Wang created KAFKA-2923:


 Summary: Improve 0.9.0 Upgrade Documents 
 Key: KAFKA-2923
 URL: https://issues.apache.org/jira/browse/KAFKA-2923
 Project: Kafka
  Issue Type: Bug
Reporter: Guozhang Wang
 Fix For: 0.9.0.1


A couple of places we can improve the upgrade docs:

1) Explanation about replica.lag.time.max.ms and how it relates to the old 
configs.

2) Default quota configs.

3) Client-server compatibility: old clients working with new servers and new 
clients working with old servers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2308) New producer + Snappy face un-compression errors after broker restart

2015-12-01 Thread Allen Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034439#comment-15034439
 ] 

Allen Wang commented on KAFKA-2308:
---

[~gwenshap] [~guozhang] Is the fix in producer only? If I take 0.8.2.2 
producer, do I also need to have broker/consumer upgraded to 0.8.2.2 or a later 
snappy version in order to avoid this bug? Currently our broker is on 0.8.2.1 
and snappy 1.1.1.6. 

> New producer + Snappy face un-compression errors after broker restart
> -
>
> Key: KAFKA-2308
> URL: https://issues.apache.org/jira/browse/KAFKA-2308
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>Assignee: Gwen Shapira
> Fix For: 0.9.0.0, 0.8.2.2
>
> Attachments: KAFKA-2308.patch
>
>
> Looks like the new producer, when used with Snappy, following a broker 
> restart is sending messages the brokers can't decompress. This issue was 
> discussed at few mailing lists thread, but I don't think we ever resolved it.
> I can reproduce with trunk and Snappy 1.1.1.7. 
> To reproduce:
> 1. Start 3 brokers
> 2. Create a topic with 3 partitions and 3 replicas each.
> 2. Start performance producer with --new-producer --compression-codec 2 (and 
> set the number of messages to fairly high, to give you time. I went with 10M)
> 3. Bounce one of the brokers
> 4. The log of one of the surviving nodes should contain errors like:
> {code}
> 2015-07-02 13:45:59,300 ERROR kafka.server.ReplicaManager: [Replica Manager 
> on Broker 66]: Error processing append operation on partition [t3,0]
> kafka.common.KafkaException:
> at 
> kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:94)
> at 
> kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:64)
> at 
> kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
> at 
> kafka.message.ByteBufferMessageSet$$anon$2.innerDone(ByteBufferMessageSet.scala:177)
> at 
> kafka.message.ByteBufferMessageSet$$anon$2.makeNext(ByteBufferMessageSet.scala:218)
> at 
> kafka.message.ByteBufferMessageSet$$anon$2.makeNext(ByteBufferMessageSet.scala:173)
> at 
> kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> kafka.message.ByteBufferMessageSet.validateMessagesAndAssignOffsets(ByteBufferMessageSet.scala:267)
> at kafka.log.Log.liftedTree1$1(Log.scala:327)
> at kafka.log.Log.append(Log.scala:326)
> at 
> kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:423)
> at 
> kafka.cluster.Partition$$anonfun$appendMessagesToLeader$1.apply(Partition.scala:409)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
> at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:268)
> at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:409)
> at 
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:365)
> at 
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:350)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>

[GitHub] kafka-site pull request: MINOR: Update JVM version and tuning sett...

Github user asfgit closed the pull request at:

https://github.com/apache/kafka-site/pull/6


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] kafka-site pull request: MINOR: Update JVM version and tuning sett...

2015-12-01 Thread guozhangwang

Github user guozhangwang commented on the pull request:

https://github.com/apache/kafka-site/pull/6#issuecomment-161068767
  
LGTM, merged to asf-site.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] kafka pull request: MINOR: Update LinkedIn JVM tuning settings

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/607


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

2015-12-01 Thread Joe Stein

I would like to start a discussion around the work that has started in
regards to KIP-30
https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems

The impetus for working on this came a lot from the community. For the last
year(~+) it has been the most asked question at any talk I have given
(personally speaking). It has come up a bit also on the mailing list
talking about zkclient vs currator. A lot of folks want to use Kafka but
introducing dependencies are hard for the enterprise so the goals behind
this is making it so that using Kafka can be done as easy as possible for
the operations teams to-do when they do. If they are already supporting
ZooKeeper they can keep doing that but if not they want (users) to use
something else they are already supporting that can plug-in to-do the same
things.

For the core project I think we should leave in upstream what we have. This
gives a great baseline regression for folks and makes the work for "making
what we have plug-able work" a good defined task (carve out, layer in API
impl, push back tests pass). From there then when folks want their
implementation to be something besides ZooKeeper they can develop, test and
support that if they choose.

We would like to suggest that we have the plugin interface be Java based
for minimizing depends for JVM impl. This could be in another directory
something TBD /.

If you have a server you want to try to get it working but you aren't on
the JVM don't be afraid just think about a REST impl and if you can work
inside of that you have some light RPC layers (this was the first pass
prototype we did to flush-out the public api presented on the KIP).

There are a lot of parts to working on this and the more implementations we
have the better we can flush out the public interface. I will leave the
technical details and design to JIRA tickets that are linked through the
confluence page as these decisions come about and code starts for reviews
and we can target the specific modules having the context separate is
helpful especially if multiple folks are working on it.
https://issues.apache.org/jira/browse/KAFKA-2916

Do other folks want to build implementations? Maybe we should start a
confluence page for those or use an existing one and add to it so we can
coordinate some there to.

Thanks!

~ Joe Stein
- - - - - - - - - - - - - - - - - - -
 [image: Logo-Black.jpg]
  http://www.elodina.net
http://www.stealth.ly
- - - - - - - - - - - - - - - - - - -

[jira] [Commented] (KAFKA-2915) System Tests that use bootstrap.servers embedded in jinja files are not working


[ 
https://issues.apache.org/jira/browse/KAFKA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034245#comment-15034245
 ] 

ASF GitHub Bot commented on KAFKA-2915:
---

GitHub user benstopford opened a pull request:

https://github.com/apache/kafka/pull/608

KAFKA-2915: Fix problem with System Tests that use bootstrap.servers 
embedded in jinja files

Fixes problems in mirror maker and consumer tests
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/290/
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/289/

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/benstopford/kafka KAFKA-2915-jinja-bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/608.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #608


commit 640532b7ca10298d545e523e291e6f6fe82843c6
Author: Ben Stopford 
Date:   2015-12-01T15:27:46Z

KAFKA-2915: Added call security protocol to bootstrap servers call in jinja 
file

commit 192d96c6a53481db5b8dc428f0a2eb6d401862ea
Author: Ben Stopford 
Date:   2015-12-01T16:40:56Z

KAFKA-2915: fixed string formatting




> System Tests that use bootstrap.servers embedded in jinja files are not 
> working
> ---
>
> Key: KAFKA-2915
> URL: https://issues.apache.org/jira/browse/KAFKA-2915
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: KAFKA-2915: Fix problem with System Tests that...

2015-12-01 Thread benstopford

GitHub user benstopford opened a pull request:

https://github.com/apache/kafka/pull/608

KAFKA-2915: Fix problem with System Tests that use bootstrap.servers 
embedded in jinja files

Fixes problems in mirror maker and consumer tests
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/290/
http://jenkins.confluent.io/job/kafka_system_tests_branch_builder/289/

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/benstopford/kafka KAFKA-2915-jinja-bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/608.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #608


commit 640532b7ca10298d545e523e291e6f6fe82843c6
Author: Ben Stopford 
Date:   2015-12-01T15:27:46Z

KAFKA-2915: Added call security protocol to bootstrap servers call in jinja 
file

commit 192d96c6a53481db5b8dc428f0a2eb6d401862ea
Author: Ben Stopford 
Date:   2015-12-01T16:40:56Z

KAFKA-2915: fixed string formatting




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Assigned] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-2908:
--

Assignee: Jason Gustafson

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034222#comment-15034222
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] I found an error in my analysis of KAFKA-2909 meaning that jira 
refers to actual data loss. KAFKA-2908 remains a client-side issue. This puts 
more evidence behind your theory that nodes are being killed before data is 
replicated. I'll be interested to see if this change is stable on Ec2.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2909) Example of Gap in Consumption after Restart


 [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2909:

Description: 
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482

In this case offset 12216 does not contain the value we expected. The 
implication is that the data was either not written or was lost during 
rebalancing. 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]


  was:
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482

So in this instance the data is not there! 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> In this case offset 12216 does not contain the value we expected. The

[jira] [Updated] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2908:

Description: 
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385

=> ID 16385 is definitely present in the Kafka logs suggesting a problem 
consumer-side
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are attached



  was:
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are attached




> Another, possibly different, Gap in Consumption after Restart
> -
>
>

[jira] [Commented] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034160#comment-15034160
 ] 

Ben Stopford commented on KAFKA-2908:
-

This problem is consumer-side

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2909) Example of Gap in Consumption after Restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034158#comment-15034158
 ] 

Ben Stopford commented on KAFKA-2909:
-

This problem is producer-side

> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> So in this instance the data is not there! 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correlate with the consumer logging out 
> NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of 
> these messages though over a long period so it’s hard to infer this as being 
> the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click 
> me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2909) Example of Gap in Consumption after Restart


 [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2909:

Description: 
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482

So in this instance the data is not there! 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]


  was:
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

-> offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408

in all three data files. So the data is there. 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> So in this instance the data is not there! 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correla

[jira] [Updated] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2908:

Description: 
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are attached



  was:
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8 repectively:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are at

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034103#comment-15034103
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] The one thing we know for sure is that putting time between bounces 
solves the problem. Checking for the ISR to have two entries is a good option. 
You may even need to pad with pauses. It'd be great to get this test merged 
though, even if we have to go back to refactor it later. 

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1374) LogCleaner (compaction) does not support compressed topics

2015-12-01 Thread Grant Henke (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034056#comment-15034056
 ] 

Grant Henke commented on KAFKA-1374:


With this being resolved, should we update this section of the docs? 
http://kafka.apache.org/documentation.html#design_compactionlimitations

> LogCleaner (compaction) does not support compressed topics
> --
>
> Key: KAFKA-1374
> URL: https://issues.apache.org/jira/browse/KAFKA-1374
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
>Assignee: Manikumar Reddy
>  Labels: newbie++
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-1374.patch, KAFKA-1374.patch, 
> KAFKA-1374_2014-08-09_16:18:55.patch, KAFKA-1374_2014-08-12_22:23:06.patch, 
> KAFKA-1374_2014-09-23_21:47:12.patch, KAFKA-1374_2014-10-03_18:49:16.patch, 
> KAFKA-1374_2014-10-03_19:17:17.patch, KAFKA-1374_2015-01-18_00:19:21.patch, 
> KAFKA-1374_2015-05-18_22:55:48.patch, KAFKA-1374_2015-05-19_17:20:44.patch
>
>
> This is a known issue, but opening a ticket to track.
> If you try to compact a topic that has compressed messages you will run into
> various exceptions - typically because during iteration we advance the
> position based on the decompressed size of the message. I have a bunch of
> stack traces, but it should be straightforward to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2922) System and Replication tools verification

Andrii Biletskyi created KAFKA-2922:
---

 Summary: System and Replication tools verification
 Key: KAFKA-2922
 URL: https://issues.apache.org/jira/browse/KAFKA-2922
 Project: Kafka
  Issue Type: Sub-task
Reporter: Andrii Biletskyi


Go through System 
(https://cwiki.apache.org/confluence/display/KAFKA/System+Tools) and 
Replication tools 
(https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools) to verify 
(update/deprecate) which tools need to be abstracted from Zookeeper as the 
exclusive storage sub-system for Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2921) Plug-able implementations support

Andrii Biletskyi created KAFKA-2921:
---

 Summary: Plug-able implementations support
 Key: KAFKA-2921
 URL: https://issues.apache.org/jira/browse/KAFKA-2921
 Project: Kafka
  Issue Type: Sub-task
Reporter: Andrii Biletskyi


Add infrastructure to support plug-able implementations in runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2920) Storage module

Andrii Biletskyi created KAFKA-2920:
---

 Summary: Storage module
 Key: KAFKA-2920
 URL: https://issues.apache.org/jira/browse/KAFKA-2920
 Project: Kafka
  Issue Type: Sub-task
Reporter: Andrii Biletskyi


Add Storage module and implement it on top of Zookeeper as a default version. 
Replace ZkUtils calls with calls to Storage interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2919) ListenerRegistry

Andrii Biletskyi created KAFKA-2919:
---

 Summary: ListenerRegistry
 Key: KAFKA-2919
 URL: https://issues.apache.org/jira/browse/KAFKA-2919
 Project: Kafka
  Issue Type: Sub-task
Reporter: Andrii Biletskyi


Add ListenerRegistry interface and implement it on top of Zookeeper as a 
default version. Integrate it into Kafka instead of zookeeper data watchers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2917) LeaderElection module

Andrii Biletskyi created KAFKA-2917:
---

 Summary: LeaderElection module
 Key: KAFKA-2917
 URL: https://issues.apache.org/jira/browse/KAFKA-2917
 Project: Kafka
  Issue Type: Sub-task
Reporter: Andrii Biletskyi


Add LeaderElection interface and turn current ZookeeperLeaderElector into a 
default implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2918) GroupMembership module

Andrii Biletskyi created KAFKA-2918:
---

 Summary: GroupMembership module
 Key: KAFKA-2918
 URL: https://issues.apache.org/jira/browse/KAFKA-2918
 Project: Kafka
  Issue Type: Sub-task
Reporter: Andrii Biletskyi


Add GroupMembership interface and implement it on top of Zookeeper as a default 
version. Integrate implementation for:
a) cluster (brokers) membership 
b) consumer group (High Level Consumer) membership



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2916) KIP-30 Plug-able interface for consensus and storage sub-systems

Andrii Biletskyi created KAFKA-2916:
---

 Summary: KIP-30 Plug-able interface for consensus and storage 
sub-systems
 Key: KAFKA-2916
 URL: https://issues.apache.org/jira/browse/KAFKA-2916
 Project: Kafka
  Issue Type: New Feature
  Components: core
Reporter: Andrii Biletskyi


Checklist for KIP-30 
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033850#comment-15033850
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford] I dont see errors in my local replication test runs when run 
with PLAINTEXT with either new consumer or old consumer. But it could just be 
hiding timing issues because the consumer is faster. I will run the tests again 
tonight with the fix from KAFKA-2913. I am hopeful that once your the issues 
you are seeing are fixed, the replication tests would just work :-)

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2915) System Tests that use bootstrap.servers embedded in jinja files are not working


 [ 
https://issues.apache.org/jira/browse/KAFKA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2915:

Summary: System Tests that use bootstrap.servers embedded in jinja files 
are not working  (was: Tests that use bootstrap.servers embedded in jinja files 
are not working)

> System Tests that use bootstrap.servers embedded in jinja files are not 
> working
> ---
>
> Key: KAFKA-2915
> URL: https://issues.apache.org/jira/browse/KAFKA-2915
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2915) Tests that use bootstrap.servers embedded in jinja files are not working

Ben Stopford created KAFKA-2915:
---

 Summary: Tests that use bootstrap.servers embedded in jinja files 
are not working
 Key: KAFKA-2915
 URL: https://issues.apache.org/jira/browse/KAFKA-2915
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford


Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033724#comment-15033724
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] so - in my investigations, even with min.insync.replicas = 2 + 
clean_shutdown additional pauses are needed between bounces to get long term 
stability on Ec2. My theory is this is a problem consumer-side because I don't 
see evidence of data loss in Kafka. Maybe by waiting for the ISR to hit 2 you 
are getting similar behaviour. Your test is a little more extreme though due to 
the hard_bounce.   

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2896) System test for partition re-assignment

2015-12-01 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033706#comment-15033706
 ] 

jin xing commented on KAFKA-2896:
-

Hi, [~gwenshap]
For this task, in my understanding, we need to do a test for 
ReassignPartitionsCommand;
Threre are three functions to test: 'verifyAssignment', 'generateAssignment' 
and 'executeAssignment';
I think we need to implement a 'MockZkUtils'  for the test;
If it is appropriate, may I do this task?

> System test for partition re-assignment
> ---
>
> Key: KAFKA-2896
> URL: https://issues.apache.org/jira/browse/KAFKA-2896
> Project: Kafka
>  Issue Type: Task
>Reporter: Gwen Shapira
>
> Lots of users depend on partition re-assignment tool to manage their cluster. 
> Will be nice to have a simple system tests that creates a topic with few 
> partitions and few replicas, reassigns everything and validates the ISR 
> afterwards. 
> Just to make sure we are not breaking anything. Especially since we have 
> plans to improve (read: modify) this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033694#comment-15033694
 ] 

Rajini Sivaram edited comment on KAFKA-2891 at 12/1/15 1:45 PM:


[~benstopford] Yes, you are right, replication test does set 
min.insync.replicas, ignore my previous comment. Have deleted the comment to 
avoid confusion.


was (Author: rsivaram):
[~benstopford] Yes, you are right, replication test does set 
min.insync.replicas, ignore my previous comment.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033694#comment-15033694
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford] Yes, you are right, replication test does set 
min.insync.replicas, ignore my previous comment.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


 [ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-2891:
--
Comment: was deleted

(was: [~geoffra] Replication tests expect all ack'ed messages to be received 
even though it runs with the default min.insync.replicas=1. The tests kills the 
leader of a partition in a loop while messages are being produced and consumed. 
This can (and does) result in ISRs dropping down to 1 (just the leader is the 
ISR list). Messages published when there are no other replicas are lost if the 
leader (the only ISR) is killed. It seems to me that the test's expectations 
are too high. When I modify the test (hard_bounce with SSL/SASL) to wait until 
there are atleast two entries in the ISR list before killing the leader, it 
passes reliably in my local test runs. I wonder if the only reason this test 
has been working is because PLAINTEXT consumers keep up with the producer and 
hence are unlikely to lose messages. Would it be a reasonable change to the 
test to ensure that there are at least two ISRs before killing the leader?)

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033560#comment-15033560
 ] 

Ben Stopford edited comment on KAFKA-2891 at 12/1/15 12:18 PM:
---

[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently like that. Although doesn't it set the 
min.insync.replicas to 2 in the test constructor?

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 





was (Author: benstopford):
[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently.

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 




> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033560#comment-15033560
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently.

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 




> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka-site pull request: MINOR: Update JVM version and tuning sett...

2015-12-01 Thread ijuma

GitHub user ijuma opened a pull request:

https://github.com/apache/kafka-site/pull/6

MINOR: Update JVM version and tuning settings

Changes copied from 0.9.0 branch of the kafka repo.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ijuma/kafka-site jvm-version-and-tuning

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka-site/pull/6.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6


commit 70833375c410fc8e03973ea2dca78301a3b4165c
Author: Ismael Juma 
Date:   2015-12-01T11:03:34Z

Update JVM version and tuning settings

Changes copied from 0.9.0 branch of the kafka repo.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] kafka pull request: MINOR: Update LinkedIn JVM tuning settings

2015-12-01 Thread ijuma

GitHub user ijuma opened a pull request:

https://github.com/apache/kafka/pull/607

MINOR: Update LinkedIn JVM tuning settings

This is a follow-up as the previous update was missing some changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ijuma/kafka java-tuning-settings

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #607


commit 53ba9386fa25a54a24a45b636682b216e1f6
Author: Ismael Juma 
Date:   2015-12-01T10:40:10Z

Update LinkedIn tuning settings




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033457#comment-15033457
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~geoffra] Replication tests expect all ack'ed messages to be received even 
though it runs with the default min.insync.replicas=1. The tests kills the 
leader of a partition in a loop while messages are being produced and consumed. 
This can (and does) result in ISRs dropping down to 1 (just the leader is the 
ISR list). Messages published when there are no other replicas are lost if the 
leader (the only ISR) is killed. It seems to me that the test's expectations 
are too high. When I modify the test (hard_bounce with SSL/SASL) to wait until 
there are atleast two entries in the ISR list before killing the leader, it 
passes reliably in my local test runs. I wonder if the only reason this test 
has been working is because PLAINTEXT consumers keep up with the producer and 
hence are unlikely to lose messages. Would it be a reasonable change to the 
test to ensure that there are at least two ISRs before killing the leader?

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request: MINOR - fix typo in index corruption warning m...

2015-12-01 Thread lindong28

GitHub user lindong28 opened a pull request:

https://github.com/apache/kafka/pull/606

MINOR - fix typo in index corruption warning message



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lindong28/kafka minor-fix-typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/606.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #606


commit a7ee81430e833c783507e89e98fb5334014954f6
Author: Dong Lin 
Date:   2015-12-01T09:36:13Z

MINOR - fix typo in warning message




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-2718) Reuse of temporary directories leading to transient unit test failures


[ 
https://issues.apache.org/jira/browse/KAFKA-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033407#comment-15033407
 ] 

Rajini Sivaram commented on KAFKA-2718:
---

[~guozhang] I tried to recreate, but I couldn't. I will submit a PR with more 
logging to see if we can get more info when the failure occurs.

> Reuse of temporary directories leading to transient unit test failures
> --
>
> Key: KAFKA-2718
> URL: https://issues.apache.org/jira/browse/KAFKA-2718
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 0.9.1.0
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
> Fix For: 0.9.0.1
>
>
> Stack traces in some of the transient unit test failures indicate that 
> temporary directories used for Zookeeper are being reused.
> {quote}
> kafka.common.TopicExistsException: Topic "topic" already exists.
>   at 
> kafka.admin.AdminUtils$.createOrUpdateTopicPartitionAssignmentPathInZK(AdminUtils.scala:253)
>   at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:237)
>   at kafka.utils.TestUtils$.createTopic(TestUtils.scala:231)
>   at kafka.api.BaseConsumerTest.setUp(BaseConsumerTest.scala:63)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2914) Kafka Connect Source connector for HBase

2015-12-01 Thread Niels Basjes (JIRA)

Niels Basjes created KAFKA-2914:
---

 Summary: Kafka Connect Source connector for HBase 
 Key: KAFKA-2914
 URL: https://issues.apache.org/jira/browse/KAFKA-2914
 Project: Kafka
  Issue Type: New Feature
  Components: copycat
Reporter: Niels Basjes
Assignee: Ewen Cheslack-Postava


In many cases I see HBase being used to persist data.
I would like to listen to the changes and process them in a streaming system 
(like Apache Flink).

Feature request: A Kafka Connect "Source" that listens to the changes in a 
specified HBase table. These changes are then stored in a 'standardized' form 
in Kafka so that it becomes possible to process the observed changes in 
near-realtime. I expect this 'standard' to be very HBase specific.

Implementation suggestion: Perhaps listening to the HBase WAL like the "HBase 
Side Effects Processor" does?
https://github.com/NGDATA/hbase-indexer/tree/master/hbase-sep




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: kafka-trunk-jdk8 #196