[jira] [Created] (KAFKA-2191) Measured rate should not be infinite

2015-05-12 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-2191:
---

 Summary: Measured rate should not be infinite
 Key: KAFKA-2191
 URL: https://issues.apache.org/jira/browse/KAFKA-2191
 Project: Kafka
  Issue Type: Bug
Reporter: Dong Lin
Assignee: Dong Lin


Rate.measure() is called, it calculates elapsed time as now - 
stat.oldest(now).lastWindowMs. But the stat.oldest(now) may equal now due to 
the way SampledStat is implemented. As a result, Rate.measure() may return 
Infinite. 

This bug needs to be fixed in order for quota implementation to work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Message corruption with new Java client + snappy + broker restart

2015-05-12 Thread Jiangjie Qin
Does this related to KAFKA-2189?

Jiangjie (Becket) Qin

On 5/12/15, 9:41 PM, "Roger Hoover"  wrote:

>Hi,
>
>When using Samza 0.9.0 which uses the new Java producer client and snappy
>enabled, I see messages getting corrupted on the client side.  It never
>happens with the old producer and it never happens with lz4, gzip, or no
>compression.  It only happens when a broker gets restarted (or maybe just
>shutdown).
>
>The error is not always the same.  I've noticed at least three types of
>errors on the Kafka brokers.
>
>1) java.io.IOException: failed to read chunk
>at
>org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:35
>6)
>http://pastebin.com/NZrrEHxU
>2) java.lang.OutOfMemoryError: Java heap space
>   at
>org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:34
>6)
>http://pastebin.com/yuxk1BjY
>3) java.io.IOException: PARSING_ERROR(2)
>  at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>http://pastebin.com/yq98Hx49
>
>I've noticed a couple different behaviors from the Samza producer/job
>A) It goes into a long retry loop where this message is logged.  I saw
>this
>with error #1 above.
>
>2015-04-29 18:17:31 Sender [WARN] task[Partition 7]
>ssp[kafka,svc.call.w_deploy.c7tH4YaiTQyBEwAAhQzRXw,7] offset[253] Got
>error produce response with correlation id 4878 on topic-partition
>svc.call.w_deploy.T2UDe2PWRYWcVAAAhMOAwA-1, retrying (2147483646 attempts
>left). Error: CORRUPT_MESSAGE
>
>B) The job exits with
>org.apache.kafka.common.errors.UnknownServerException (at least when run
>as
>ThreadJob).  I saw this with error #3 above.
>
>org.apache.samza.SamzaException: Unable to send message from
>TaskName-Partition 6 to system kafka.
>org.apache.kafka.common.errors.UnknownServerException: The server
>experienced an unexpected error when processing the request
>
>There seem to be two issues here:
>
>1) When leadership for a topic is transferred to another broker, the Java
>client (I think) has to move the data it was buffering for the original
>leader broker to the buffer for the new leader.  My guess is that the
>corruption is happening at this point.
>
>2) When a producer has corrupt message, it retries 2.1 billions times in a
>hot loop even though it's not a retriable error.  It probably shouldn't
>retry on such errors.  For retriable errors, it would be much safer to
>have
>a backoff scheme for retries.
>
>Thanks,
>
>Roger



Message corruption with new Java client + snappy + broker restart

2015-05-12 Thread Roger Hoover
Hi,

When using Samza 0.9.0 which uses the new Java producer client and snappy
enabled, I see messages getting corrupted on the client side.  It never
happens with the old producer and it never happens with lz4, gzip, or no
compression.  It only happens when a broker gets restarted (or maybe just
shutdown).

The error is not always the same.  I've noticed at least three types of
errors on the Kafka brokers.

1) java.io.IOException: failed to read chunk
at
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:356)
http://pastebin.com/NZrrEHxU
2) java.lang.OutOfMemoryError: Java heap space
   at
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:346)
http://pastebin.com/yuxk1BjY
3) java.io.IOException: PARSING_ERROR(2)
  at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
http://pastebin.com/yq98Hx49

I've noticed a couple different behaviors from the Samza producer/job
A) It goes into a long retry loop where this message is logged.  I saw this
with error #1 above.

2015-04-29 18:17:31 Sender [WARN] task[Partition 7]
ssp[kafka,svc.call.w_deploy.c7tH4YaiTQyBEwAAhQzRXw,7] offset[253] Got
error produce response with correlation id 4878 on topic-partition
svc.call.w_deploy.T2UDe2PWRYWcVAAAhMOAwA-1, retrying (2147483646 attempts
left). Error: CORRUPT_MESSAGE

B) The job exits with
org.apache.kafka.common.errors.UnknownServerException (at least when run as
ThreadJob).  I saw this with error #3 above.

org.apache.samza.SamzaException: Unable to send message from
TaskName-Partition 6 to system kafka.
org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request

There seem to be two issues here:

1) When leadership for a topic is transferred to another broker, the Java
client (I think) has to move the data it was buffering for the original
leader broker to the buffer for the new leader.  My guess is that the
corruption is happening at this point.

2) When a producer has corrupt message, it retries 2.1 billions times in a
hot loop even though it's not a retriable error.  It probably shouldn't
retry on such errors.  For retriable errors, it would be much safer to have
a backoff scheme for retries.

Thanks,

Roger


Re: Review Request 34144: Patch for KAFKA-2189

2015-05-12 Thread Ismael Juma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34144/
---

(Updated May 13, 2015, 12:59 a.m.)


Review request for kafka.


Bugs: KAFKA-2189
https://issues.apache.org/jira/browse/KAFKA-2189


Repository: kafka


Description
---

This was a regression introduced in snappy-java 1.1.1.2 and fixed
in 1.1.1.7:

https://github.com/xerial/snappy-java/commit/dc2dd27f85e5167961883f71ac2681b73b33e5df


Diffs
-

  build.gradle fef515b3b2276b1f861e7cc2e33e74c3ce5e405b 

Diff: https://reviews.apache.org/r/34144/diff/


Testing (updated)
---

Tests pass when compiled with Scala 2.10.5 and 2.11.6.


Thanks,

Ismael Juma



[jira] [Commented] (KAFKA-1347) Create a system test for network partitions

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541125#comment-14541125
 ] 

Geoffrey Anderson commented on KAFKA-1347:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> Create a system test for network partitions
> ---
>
> Key: KAFKA-1347
> URL: https://issues.apache.org/jira/browse/KAFKA-1347
> Project: Kafka
>  Issue Type: Test
>Reporter: Jay Kreps
>
> We got some free and rather public QA here:
> http://aphyr.com/posts/293-call-me-maybe-kafka
> We have since added a configuration to disable unclean leader election which 
> allows you to prefer consistency over availability when all brokers fail.
> This has some unit tests, but ultimately there is no reason to believe this 
> works unless we have a fairly aggressive system test case for it.
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+System+Tests
> It would be good to add support for network partitions. I don't think we 
> actually need to try to use the jepsen stuff directly, we can just us the 
> underlying tools it uses--iptables and tc. These are linux specific, but that 
> is prolly okay. You can see these at work here:
> https://github.com/aphyr/jepsen/blob/master/src/jepsen/control/net.clj
> Having this would help provide better evidence that this works now, and would 
> keep it working in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1918) System test for ZooKeeper quorum failure scenarios

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541127#comment-14541127
 ] 

Geoffrey Anderson commented on KAFKA-1918:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> System test for ZooKeeper quorum failure scenarios
> --
>
> Key: KAFKA-1918
> URL: https://issues.apache.org/jira/browse/KAFKA-1918
> Project: Kafka
>  Issue Type: Test
>Reporter: Omid Aladini
>
> Following up on the [conversation on the mailing 
> list|http://mail-archives.apache.org/mod_mbox/kafka-users/201502.mbox/%3CCAHwHRrX3SAWDUGF5LjU4rrMUsqv%3DtJcyjX7OENeL5C_V5o3tCw%40mail.gmail.com%3E],
>  the FAQ writes:
> {quote}
> Once the Zookeeper quorum is down, brokers could result in a bad state and 
> could not normally serve client requests, etc. Although when Zookeeper quorum 
> recovers, the Kafka brokers should be able to resume to normal state 
> automatically, _there are still a few +corner cases+ the they cannot and a 
> hard kill-and-recovery is required to bring it back to normal_. Hence it is 
> recommended to closely monitor your zookeeper cluster and provision it so 
> that it is performant.
> {quote}
> As ZK quorum failures are inevitable (due to rolling upgrades of ZK, leader 
> hardware failure, etc), it would be great to identify the corner cases (if 
> they still exist) and fix them if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-976) Order-Preserving Mirror Maker Testcase

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541124#comment-14541124
 ] 

Geoffrey Anderson commented on KAFKA-976:
-

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> Order-Preserving Mirror Maker Testcase
> --
>
> Key: KAFKA-976
> URL: https://issues.apache.org/jira/browse/KAFKA-976
> Project: Kafka
>  Issue Type: Test
>Reporter: Guozhang Wang
>Assignee: John Fung
> Attachments: kafka-976-v1.patch
>
>
> A new testcase (5007) for mirror_maker_testsuite is needed for the 
> key-dependent order-preserving mirror maker, this is related to KAFKA-957.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1604) System Test for Transaction Management

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541123#comment-14541123
 ] 

Geoffrey Anderson commented on KAFKA-1604:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> System Test for Transaction Management
> --
>
> Key: KAFKA-1604
> URL: https://issues.apache.org/jira/browse/KAFKA-1604
> Project: Kafka
>  Issue Type: Test
>Reporter: Dong Lin
>Assignee: Dong Lin
>  Labels: transactions
> Attachments: KAFKA-1604_2014-08-19_17:31:16.patch, 
> KAFKA-1604_2014-08-19_21:07:35.patch
>
>
> Perform end-to-end transaction management test in the following steps:
> 1) Start Zookeeper.
> 2) Start multiple brokers.
> 3) Create topic.
> 4) Start transaction-aware ProducerPerformance to generate transactional 
> messages to topic.
> 5) Start transaction-aware ConsoleConsumer to read messages from topic.
> 6) Bounce brokers (optional).
> 7) Verify that same number of messages are sent and received.
> This patch depends on KAFKA-1524, KAFKA-1526 and KAFKA-1601.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2003) Add upgrade tests

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541121#comment-14541121
 ] 

Geoffrey Anderson commented on KAFKA-2003:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> Add upgrade tests
> -
>
> Key: KAFKA-2003
> URL: https://issues.apache.org/jira/browse/KAFKA-2003
> Project: Kafka
>  Issue Type: Improvement
>  Components: system tests
>Reporter: Gwen Shapira
>Assignee: Ashish K Singh
>
> To test protocol changes, compatibility and upgrade process, we need a good 
> way to test different versions of the product together and to test end-to-end 
> upgrade process.
> For example, for 0.8.2 to 0.8.3 test we want to check:
> * Can we start a cluster with a mix of 0.8.2 and 0.8.3 brokers?
> * Can a cluster of 0.8.3 brokers bump the protocol level one broker at a time?
> * Can 0.8.2 clients run against a cluster of 0.8.3 brokers?
> There are probably more questions. But an automated framework that can test 
> those and report results will be a good start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1904) run sanity failed test

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541120#comment-14541120
 ] 

Geoffrey Anderson commented on KAFKA-1904:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> run sanity failed test
> --
>
> Key: KAFKA-1904
> URL: https://issues.apache.org/jira/browse/KAFKA-1904
> Project: Kafka
>  Issue Type: Bug
>  Components: system tests
>Reporter: Joe Stein
>Priority: Blocker
> Fix For: 0.8.3
>
> Attachments: run_sanity.log.gz
>
>
> _test_case_name  :  testcase_1
> _test_class_name  :  ReplicaBasicTest
> arg : bounce_broker  :  true
> arg : broker_type  :  leader
> arg : message_producing_free_time_sec  :  15
> arg : num_iteration  :  2
> arg : num_messages_to_produce_per_producer_call  :  50
> arg : num_partition  :  2
> arg : replica_factor  :  3
> arg : sleep_seconds_between_producer_calls  :  1
> validation_status  : 
>  Test completed  :  FAILED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1898) compatibility testing framework

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541122#comment-14541122
 ] 

Geoffrey Anderson commented on KAFKA-1898:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> compatibility testing framework 
> 
>
> Key: KAFKA-1898
> URL: https://issues.apache.org/jira/browse/KAFKA-1898
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joe Stein
> Fix For: 0.8.3
>
> Attachments: cctk.png
>
>
> There are a few different scenarios where you want/need to know the 
> status/state of a client library that works with Kafka. Client library 
> development is not just about supporting the wire protocol but also the 
> implementations around specific interactions of the API.  The API has 
> blossomed into a robust set of producer, consumer, broker and administrative 
> calls all of which have layers of logic above them.  A Client Library may 
> choose to deviate from the path the project sets out and that is ok. The goal 
> of this ticket is to have a system for Kafka that can help to explain what 
> the library is or isn't doing (regardless of what it claims).
> The idea behind this stems in being able to quickly/easily/succinctly analyze 
> the topic message data. Once you can analyze the topic(s) message you can 
> gather lots of information about what the client library is doing, is not 
> doing and such.  There are a few components to this.
> 1) dataset-generator 
> Test Kafka dataset generation tool. Generates a random text file with given 
> params:
> --filename, -f - output file name.
> --filesize, -s - desired size of output file. The actual size will always be 
> a bit larger (with a maximum size of $filesize + $max.length - 1)
> --min.length, -l - minimum generated entry length.
> --max.length, -h - maximum generated entry length.
> Usage:
> ./gradlew build
> java -jar dataset-generator/build/libs/dataset-generator-*.jar -s 10 -l 2 
> -h 20
> 2) dataset-producer
> Test Kafka dataset producer tool. Able to produce the given dataset to Kafka 
> or Syslog server.  The idea here is you already have lots of data sets that 
> you want to test different things for. You might have different sized 
> messages, formats, etc and want a repeatable benchmark to run and re-run the 
> testing on. You could just have a days worth of data and just choose to 
> replay it.  The CCTK idea is that you are always starting from CONSUME in 
> your state of library. If your library is only producing then you will fail a 
> bunch of tests and that might be ok for people.
> Accepts following params:
> {code}
> --filename, -f - input file name.
> --kafka, -k - Kafka broker address in host:port format. If this parameter is 
> set, --producer.config and --topic must be set too (otherwise they're 
> ignored).
> --producer.config, -p - Kafka producer properties file location.
> --topic, -t - Kafka topic to produce to.
> --syslog, -s - Syslog server address. Format: protocol://host:port 
> (tcp://0.0.0.0:5140 or udp://0.0.0.0:5141 for example)
> --loop, -l - flag to loop through file until shut off manually. False by 
> default.
> Usage:
> ./gradlew build
> java -jar dataset-producer/build/libs/dataset-producer-*.jar --filename 
> dataset --syslog tcp://0.0.0.0:5140 --loop true
> {code}
> 3) extract
> This step is good so you can save data and compare tests. It could also be 
> removed if folks are just looking for a real live test (and we could support 
> that too).  Here we are taking data out of Kafka and putting it into 
> Cassandra (but other data stores can be used too and we should come up with a 
> way to abstract this out completely so folks could implement whatever they 
> wanted.
> {code}
> package ly.stealth.shaihulud.reader
> import java.util.UUID
> import com.datastax.spark.connector._
> import com.datastax.spark.connector.cql.CassandraConnector
> import consumer.kafka.MessageAndMetadata
> import consumer.kafka.client.KafkaReceiver
> import org.apache.spark._
> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.dstream.DStream
> object Main extends App with Logging {
>   val parser = new scopt.OptionParser[ReaderConfiguration]("spark-reader") {
> head("Spark Reader for Kafka client applications", "1.0")
> opt[String]("testId") unbounded() optional() action {

[jira] [Commented] (KAFKA-1888) Add a "rolling upgrade" system test

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541118#comment-14541118
 ] 

Geoffrey Anderson commented on KAFKA-1888:
--

Just making sure you're aware of work we're doing at Confluent on system tests. 
I'll be posting a KIP for this soon, but here's some info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> Add a "rolling upgrade" system test
> ---
>
> Key: KAFKA-1888
> URL: https://issues.apache.org/jira/browse/KAFKA-1888
> Project: Kafka
>  Issue Type: Improvement
>  Components: system tests
>Reporter: Gwen Shapira
>Assignee: Abhishek Nigam
> Fix For: 0.9.0
>
> Attachments: KAFKA-1888_2015-03-23_11:54:25.patch
>
>
> To help test upgrades and compatibility between versions, it will be cool to 
> add a rolling-upgrade test to system tests:
> Given two versions (just a path to the jars?), check that you can do a
> rolling upgrade of the brokers from one version to another (using clients 
> from the old version) without losing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2171) System Test for Quotas

2015-05-12 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541116#comment-14541116
 ] 

Geoffrey Anderson commented on KAFKA-2171:
--

Just pinging this thread to make sure you're aware of work we're doing at 
Confluent on system tests. I'll be posting a KIP for this soon, but here's some 
info:

The original plan is sketched here:
https://cwiki.apache.org/confluence/display/KAFKA/System+Test+Improvements

This is the core library/test framework (WIP) which aids in writing and running 
the tests
https://github.com/confluentinc/ducktape/

This has system tests we've written to date for the Confluent Platform
https://github.com/confluentinc/muckrake

> System Test for Quotas
> --
>
> Key: KAFKA-2171
> URL: https://issues.apache.org/jira/browse/KAFKA-2171
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Dong Lin
>Assignee: Dong Lin
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Motivation and goal:
> We want to make sure that following features are working properly for both 
> consumer and producer: default quota, client-specific quota, and quota 
> sharing per clientId.
> The tests and configuration described aims to cover most of the scenarios. 
> More test cases with varying configurations (e.g. ackNum) can be added if 
> there is good reason to do so.
> Initial setup and configuration:
> In all scenarios, we first create kafka brokers and topic as follows:
> - create two kafka broker processes (by default local)
> - create a topic with replication factor = 2 and ackNum = -1
> - let max_read = max_write = 5MB. The test machine is expected to provide 
> read (write) throughput at least max_read (max_write).
> - we consider two rates are approximately the same if they differ by at most 
> 5%.
> Scenario 1: Validate that max_read and max_write are provided by the test 
> machine(s) using 1 producer and 1 consumer
> 1) produce data to the topic without rate limit for 30 seconds
> 2) record the rate of producer
> 3) then, consume data from the topic without rate limit until he finishes
> 4) record the rate of consumer
> 5) verify that the data consumed is identical to the data produced
> 6) verify that producer rate >= max_write and consumer rate >= max_read
> Scenario 2: validate the effectiveness of default write and read quota using 
> 1 producer and 1 consumer
> 1) configure brokers to use max_write/2 as default write quota and max_read/2 
> as default read quota
> 2) produce data to the topic for 30 seconds
> 3) record the rate of producer
> 4) then, consume data from the topic until he finishes
> 5) record the rate of consumer
> 6) verify that the data consumed is identical to the data produced
> 7) verify that recorded write (read) rate is within 5% of max_write/2 
> (max_read/2).
> Scenario 3: validate the effectiveness of client-specific write and read 
> quota using 2 producers and 2 consumers
> 1) configure brokers to use max_write/2 as default write quota and max_read/2 
> as default read quota; configure brokers to use max_write/4 for producer_2 
> and max_read/4 for consumer_2
> 2) both producers produce data to the topic for 30 seconds. They use 
> different clientId.
> 3) record the rate of producer
> 4) both consumers consume data from the topic until they finish. They use 
> different clientId and groupId.
> 5) record the rate of consumer
> 6) verify that the data consumed is identical to the data produced
> 7) verify that producer_1 and producer_2 rates are approximately max_write/2 
> and max_write/4; verify that consumer_1 and consumer_2 rates are 
> approximately max_read/2 and max_read/4.
> Scenario 4: validate the effectiveness of write and read quota sharing among 
> clients of same clientId using 2 producers and 2 consumers.
> 1) configure brokers to use max_write/2 as default write quota and max_read/2 
> as default read quota
> 2) both producers produce data to the topic for 30 seconds. They use same 
> clientId.
> 3) record the rate of producer
> 4) both consumers consume data from the topic until they finish. They use 
> same clientId but different groupId.
> 5) record the rate of consumer
> 6) verify that the data consumed is identical to the data produced
> 7) verify that total rate of producer_1 and producer_2 is approximately 
> max_write/2; verify that total rate of consumer_1 and consumer_2 is 
> approximately max_read/2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2190) Incorporate close(timeout) to Mirror Maker

2015-05-12 Thread Jiangjie Qin (JIRA)
Jiangjie Qin created KAFKA-2190:
---

 Summary: Incorporate close(timeout) to Mirror Maker
 Key: KAFKA-2190
 URL: https://issues.apache.org/jira/browse/KAFKA-2190
 Project: Kafka
  Issue Type: Bug
Reporter: Jiangjie Qin
Assignee: Jiangjie Qin


Use close(0) when mirror maker exits accidentally to avoid reordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541106#comment-14541106
 ] 

Ismael Juma commented on KAFKA-2189:


Created reviewboard https://reviews.apache.org/r/34144/diff/
 against branch upstream/trunk

> Snappy compression of message batches less efficient in 0.8.2.1
> ---
>
> Key: KAFKA-2189
> URL: https://issues.apache.org/jira/browse/KAFKA-2189
> Project: Kafka
>  Issue Type: Bug
>  Components: build, compression, log
>Affects Versions: 0.8.2.1
>Reporter: Olson,Andrew
>Priority: Blocker
>  Labels: trivial
> Fix For: 0.8.3
>
> Attachments: KAFKA-2189.patch
>
>
> We are using snappy compression and noticed a fairly substantial increase 
> (about 2.25x) in log filesystem space consumption after upgrading a Kafka 
> cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages 
> being seemingly recompressed individually (or possibly with a much smaller 
> buffer or dictionary?) instead of as a batch as sent by producers. We 
> eventually tracked down the change in compression ratio/scope to this [1] 
> commit that updated the snappy version from 1.0.5 to 1.1.1.3. The Kafka 
> client version does not appear to be relevant as we can reproduce this with 
> both the 0.8.1.1 and 0.8.2.1 Producer.
> Here are the log files from our troubleshooting that contain the same set of 
> 1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
> commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.
> {noformat}
> -rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
> /var/kafka2/f9d9b-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
> /var/kafka2/f9d9b-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
> /var/kafka2/f9d9b-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
> /var/kafka2/f9d9b-batch-1000-0/.log
> -rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
> /var/kafka2/f5ab8-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
> /var/kafka2/f5ab8-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
> /var/kafka2/f5ab8-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
> /var/kafka2/f5ab8-batch-1000-0/.log
> {noformat}
> [1] 
> https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-2189:
---
Assignee: Ismael Juma
  Status: Patch Available  (was: Open)

> Snappy compression of message batches less efficient in 0.8.2.1
> ---
>
> Key: KAFKA-2189
> URL: https://issues.apache.org/jira/browse/KAFKA-2189
> Project: Kafka
>  Issue Type: Bug
>  Components: build, compression, log
>Affects Versions: 0.8.2.1
>Reporter: Olson,Andrew
>Assignee: Ismael Juma
>Priority: Blocker
>  Labels: trivial
> Fix For: 0.8.3
>
> Attachments: KAFKA-2189.patch
>
>
> We are using snappy compression and noticed a fairly substantial increase 
> (about 2.25x) in log filesystem space consumption after upgrading a Kafka 
> cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages 
> being seemingly recompressed individually (or possibly with a much smaller 
> buffer or dictionary?) instead of as a batch as sent by producers. We 
> eventually tracked down the change in compression ratio/scope to this [1] 
> commit that updated the snappy version from 1.0.5 to 1.1.1.3. The Kafka 
> client version does not appear to be relevant as we can reproduce this with 
> both the 0.8.1.1 and 0.8.2.1 Producer.
> Here are the log files from our troubleshooting that contain the same set of 
> 1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
> commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.
> {noformat}
> -rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
> /var/kafka2/f9d9b-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
> /var/kafka2/f9d9b-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
> /var/kafka2/f9d9b-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
> /var/kafka2/f9d9b-batch-1000-0/.log
> -rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
> /var/kafka2/f5ab8-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
> /var/kafka2/f5ab8-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
> /var/kafka2/f5ab8-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
> /var/kafka2/f5ab8-batch-1000-0/.log
> {noformat}
> [1] 
> https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-2189:
---
Attachment: KAFKA-2189.patch

> Snappy compression of message batches less efficient in 0.8.2.1
> ---
>
> Key: KAFKA-2189
> URL: https://issues.apache.org/jira/browse/KAFKA-2189
> Project: Kafka
>  Issue Type: Bug
>  Components: build, compression, log
>Affects Versions: 0.8.2.1
>Reporter: Olson,Andrew
>Priority: Blocker
>  Labels: trivial
> Fix For: 0.8.3
>
> Attachments: KAFKA-2189.patch
>
>
> We are using snappy compression and noticed a fairly substantial increase 
> (about 2.25x) in log filesystem space consumption after upgrading a Kafka 
> cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages 
> being seemingly recompressed individually (or possibly with a much smaller 
> buffer or dictionary?) instead of as a batch as sent by producers. We 
> eventually tracked down the change in compression ratio/scope to this [1] 
> commit that updated the snappy version from 1.0.5 to 1.1.1.3. The Kafka 
> client version does not appear to be relevant as we can reproduce this with 
> both the 0.8.1.1 and 0.8.2.1 Producer.
> Here are the log files from our troubleshooting that contain the same set of 
> 1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
> commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.
> {noformat}
> -rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
> /var/kafka2/f9d9b-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
> /var/kafka2/f9d9b-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
> /var/kafka2/f9d9b-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
> /var/kafka2/f9d9b-batch-1000-0/.log
> -rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
> /var/kafka2/f5ab8-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
> /var/kafka2/f5ab8-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
> /var/kafka2/f5ab8-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
> /var/kafka2/f5ab8-batch-1000-0/.log
> {noformat}
> [1] 
> https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 34144: Patch for KAFKA-2189

2015-05-12 Thread Ismael Juma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34144/
---

Review request for kafka.


Bugs: KAFKA-2189
https://issues.apache.org/jira/browse/KAFKA-2189


Repository: kafka


Description
---

This was a regression introduced in snappy-java 1.1.1.2 and fixed
in 1.1.1.7:

https://github.com/xerial/snappy-java/commit/dc2dd27f85e5167961883f71ac2681b73b33e5df


Diffs
-

  build.gradle fef515b3b2276b1f861e7cc2e33e74c3ce5e405b 

Diff: https://reviews.apache.org/r/34144/diff/


Testing
---


Thanks,

Ismael Juma



Re: [KIP-DISCUSSION] KIP-22 Expose a Partitioner interface in the new producer

2015-05-12 Thread Jiangjie Qin
Hi Harsha,

If you open this link

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposa
ls

All the KIPs are the child page of this page which you can see from the
left bar. Only KIP-22 is missing. It looks you created it as a child page
of 

https://cwiki.apache.org/confluence/display/KAFKA/Index

Thanks.

Jiangjie (Becket) Qin

On 5/12/15, 3:12 PM, "Sriharsha Chintalapani"  wrote:

>Hi Jiangjie,
>   Its under 
>https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22++Expose+a+Partit
>ioner+interface+in+the+new+producer
>I checked other KIPS they are under /KAFKA  as well.
>
>Thanks,
>Harsha
>On May 12, 2015 at 2:12:30 PM, Jiangjie Qin (j...@linkedin.com.invalid)
>wrote:
>
>Hey Harsha,  
>
>It looks you created the KIP page at wrong place. . . Can you move the
>page to a child page of
>https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Propos
>a  
>ls  
>
>Thanks.  
>
>Jiangjie (Becket) Qin
>
>On 5/6/15, 6:12 PM, "Harsha"  wrote:
>
>>Thanks for the review Joel. I agree don't need a init method we can use
>>configure. I'll update the KIP.
>>-Harsha  
>>  
>>On Wed, May 6, 2015, at 04:45 PM, Joel Koshy wrote:
>>> +1 with a minor comment: do we need an init method given it extends
>>> Configurable?  
>>>  
>>> Also, can you move this wiki out of drafts and add it to the table in
>>>  
>>>https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Prop
>>>o  
>>>sals?  
>>>  
>>> Thanks,  
>>>  
>>> Joel  
>>>  
>>> On Wed, May 06, 2015 at 07:46:46AM -0700, Sriharsha Chintalapani
>>>wrote:  
>>> > Thanks Jay. I removed partitioner.metadata from KIP. I¹ll send an
>>>updated patch.  
>>> >  
>>> > --  
>>> > Harsha  
>>> > Sent with Airmail
>>> >  
>>> > On May 5, 2015 at 6:31:47 AM, Sriharsha Chintalapani
>>>(harsh...@fastmail.fm) wrote:
>>> >  
>>> > Thanks for the comments everyone.
>>> > Hi Jay,  
>>> > I do have a question regarding configurable interface on how to
>>>pass a Map properties. I couldn¹t find any other classes
>>>using it. JMX reporter overrides it but doesn¹t implement it. So with
>>>configurable partitioner how can a user pass in partitioner
>>>configuration since its getting instantiated within the producer.
>>> >  
>>> > Thanks,  
>>> > Harsha  
>>> >  
>>> >  
>>> > On May 4, 2015 at 10:36:45 AM, Jay Kreps (jay.kr...@gmail.com)
>>>wrote:  
>>> >  
>>> > Hey Harsha,  
>>> >  
>>> > That proposal sounds good. One minor thing--I don't think we need to
>>> 
>>>have  
>>> > the partitioner.metadata property. Our reason for using string
>>>properties  
>>> > is exactly to make config extensible at runtime. So a given
>>>partitioner can 
>>> > add whatever properties make sense using the configure() api it
>>>defines.  
>>> >  
>>> > -Jay  
>>> >  
>>> > On Sun, May 3, 2015 at 5:57 PM, Harsha  wrote:
>>> >  
>>> > > Thanks Jay & Gianmarco for the comments. I picked the option A, if
>>> 
>>>user  
>>> > > sends a partition id than it will applied and partitioner.class
>>>method  
>>> > > will only called if partition id is null .
>>> > > Please take a look at the updated KIP here
>>> > >  
>>> > >  
>>>https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22+-+Expose+a+Par
>>>t  
>>>itioner+interface+in+the+new+producer
>>> > > . Let me know if you see anything missing.
>>> > >  
>>> > > Thanks,  
>>> > > Harsha  
>>> > >  
>>> > > On Fri, Apr 24, 2015, at 02:15 AM, Gianmarco De Francisci Morales
>>>wrote:  
>>> > > > Hi,  
>>> > > >  
>>> > > >  
>>> > > > Here are the questions I think we should consider:
>>> > > > > 1. Do we need this at all given that we have the partition
>>>argument in  
>>> > > > > ProducerRecord which gives full control? I think we do need it
>>> 
>>>because  
>>> > > this  
>>> > > > > is a way to plug in a different partitioning strategy at run
>>>time and  
>>> > > do it  
>>> > > > > in a fairly transparent way.
>>> > > > >  
>>> > > >  
>>> > > > Yes, we need it if we want to support different partitioning
>>>strategies  
>>> > > > inside Kafka rather than requiring the user to code them
>>>externally.  
>>> > > >  
>>> > > >  
>>> > > > > 3. Do we need to add the value? I suspect people will have
>>>uses  
>>>for  
>>> > > > > computing something off a few fields in the value to choose
>>>the  
>>> > > partition. 
>>> > > > > This would be useful in cases where the key was being used for
>>> 
>>>log  
>>> > > > > compaction purposes and did not contain the full information
>>>for  
>>> > > computing  
>>> > > > > the partition.
>>> > > > >  
>>> > > >  
>>> > > > I am not entirely sure about this. I guess that most
>>>partitioners  
>>>should  
>>> > > > not use it.
>>> > > > I think it makes it easier to reason about the system if the
>>>partitioner  
>>> > > > only works on the key.
>>> > > > Hoever, if the value (and its serialization) are already
>>>available, there
>>> > > > is not much harm in passing them along.
>>> > > >  
>>> > > >  
>>> > > > > 4. This interface doesn't include either an init() or clo

Re: [KIP-DISCUSSION] KIP-22 Expose a Partitioner interface in the new producer

2015-05-12 Thread Jiangjie Qin
Hey Harsha,

The link in your last email gives page not found. . .

Jiangjie (Becket) Qin

On 5/12/15, 3:12 PM, "Sriharsha Chintalapani"  wrote:

>Hi Jiangjie,
>   Its under 
>https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22++Expose+a+Partit
>ioner+interface+in+the+new+producer
>I checked other KIPS they are under /KAFKA  as well.
>
>Thanks,
>Harsha
>On May 12, 2015 at 2:12:30 PM, Jiangjie Qin (j...@linkedin.com.invalid)
>wrote:
>
>Hey Harsha,  
>
>It looks you created the KIP page at wrong place. . . Can you move the
>page to a child page of
>https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Propos
>a  
>ls  
>
>Thanks.  
>
>Jiangjie (Becket) Qin
>
>On 5/6/15, 6:12 PM, "Harsha"  wrote:
>
>>Thanks for the review Joel. I agree don't need a init method we can use
>>configure. I'll update the KIP.
>>-Harsha  
>>  
>>On Wed, May 6, 2015, at 04:45 PM, Joel Koshy wrote:
>>> +1 with a minor comment: do we need an init method given it extends
>>> Configurable?  
>>>  
>>> Also, can you move this wiki out of drafts and add it to the table in
>>>  
>>>https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Prop
>>>o  
>>>sals?  
>>>  
>>> Thanks,  
>>>  
>>> Joel  
>>>  
>>> On Wed, May 06, 2015 at 07:46:46AM -0700, Sriharsha Chintalapani
>>>wrote:  
>>> > Thanks Jay. I removed partitioner.metadata from KIP. I¹ll send an
>>>updated patch.  
>>> >  
>>> > --  
>>> > Harsha  
>>> > Sent with Airmail
>>> >  
>>> > On May 5, 2015 at 6:31:47 AM, Sriharsha Chintalapani
>>>(harsh...@fastmail.fm) wrote:
>>> >  
>>> > Thanks for the comments everyone.
>>> > Hi Jay,  
>>> > I do have a question regarding configurable interface on how to
>>>pass a Map properties. I couldn¹t find any other classes
>>>using it. JMX reporter overrides it but doesn¹t implement it. So with
>>>configurable partitioner how can a user pass in partitioner
>>>configuration since its getting instantiated within the producer.
>>> >  
>>> > Thanks,  
>>> > Harsha  
>>> >  
>>> >  
>>> > On May 4, 2015 at 10:36:45 AM, Jay Kreps (jay.kr...@gmail.com)
>>>wrote:  
>>> >  
>>> > Hey Harsha,  
>>> >  
>>> > That proposal sounds good. One minor thing--I don't think we need to
>>> 
>>>have  
>>> > the partitioner.metadata property. Our reason for using string
>>>properties  
>>> > is exactly to make config extensible at runtime. So a given
>>>partitioner can 
>>> > add whatever properties make sense using the configure() api it
>>>defines.  
>>> >  
>>> > -Jay  
>>> >  
>>> > On Sun, May 3, 2015 at 5:57 PM, Harsha  wrote:
>>> >  
>>> > > Thanks Jay & Gianmarco for the comments. I picked the option A, if
>>> 
>>>user  
>>> > > sends a partition id than it will applied and partitioner.class
>>>method  
>>> > > will only called if partition id is null .
>>> > > Please take a look at the updated KIP here
>>> > >  
>>> > >  
>>>https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22+-+Expose+a+Par
>>>t  
>>>itioner+interface+in+the+new+producer
>>> > > . Let me know if you see anything missing.
>>> > >  
>>> > > Thanks,  
>>> > > Harsha  
>>> > >  
>>> > > On Fri, Apr 24, 2015, at 02:15 AM, Gianmarco De Francisci Morales
>>>wrote:  
>>> > > > Hi,  
>>> > > >  
>>> > > >  
>>> > > > Here are the questions I think we should consider:
>>> > > > > 1. Do we need this at all given that we have the partition
>>>argument in  
>>> > > > > ProducerRecord which gives full control? I think we do need it
>>> 
>>>because  
>>> > > this  
>>> > > > > is a way to plug in a different partitioning strategy at run
>>>time and  
>>> > > do it  
>>> > > > > in a fairly transparent way.
>>> > > > >  
>>> > > >  
>>> > > > Yes, we need it if we want to support different partitioning
>>>strategies  
>>> > > > inside Kafka rather than requiring the user to code them
>>>externally.  
>>> > > >  
>>> > > >  
>>> > > > > 3. Do we need to add the value? I suspect people will have
>>>uses  
>>>for  
>>> > > > > computing something off a few fields in the value to choose
>>>the  
>>> > > partition. 
>>> > > > > This would be useful in cases where the key was being used for
>>> 
>>>log  
>>> > > > > compaction purposes and did not contain the full information
>>>for  
>>> > > computing  
>>> > > > > the partition.
>>> > > > >  
>>> > > >  
>>> > > > I am not entirely sure about this. I guess that most
>>>partitioners  
>>>should  
>>> > > > not use it.
>>> > > > I think it makes it easier to reason about the system if the
>>>partitioner  
>>> > > > only works on the key.
>>> > > > Hoever, if the value (and its serialization) are already
>>>available, there
>>> > > > is not much harm in passing them along.
>>> > > >  
>>> > > >  
>>> > > > > 4. This interface doesn't include either an init() or close()
>>>method.  
>>> > > It  
>>> > > > > should implement Closable and Configurable, right?
>>> > > > >  
>>> > > >  
>>> > > > Right now the only application I can think of to have an init()
>>>and  
>>> > > > close()  
>>> > > > is to read some state informati

Add missing API to old high level consumer

2015-05-12 Thread Jiangjie Qin
Hi,

I just noticed that in KAFKA-1650 (which is before we use KIP) we added an 
offset commit method in high level consumer that commits offsets using a user 
provided offset map.

public void commitOffsets(Map 
offsetsToCommit, boolean retryOnFailure);

This method was added to all the Scala classes but I forgot to add it to Java 
API of ConsumerConnector. (Already regretting now. . .)
This method is very useful in several cases and has been asked for from time to 
time. For example, people have several threads consuming messages and 
processing them. Without this method, one thread will unexpectedly commit 
offsets for another thread, thus might lose some messages if something goes 
wrong.

I created KAFKA-2186 and hope we can add this missing method into the Java API 
of old high level consumer (literarily one line change).
Although this method should have been there since KAFKA-1650,  adding this 
method to Java API now is a public API change, just want to see if people think 
we need a KIP for this.

Thanks.

Jiangjie (Becket) Qin


Build failed in Jenkins: KafkaPreCommit #100

2015-05-12 Thread Apache Jenkins Server
See 

Changes:

[jjkoshy] KAFKA-2175; Change INFO to DEBUG for socket close on no error; and 
partition reassignment JSON; reviewed by Gwen Shapira and Joel Koshy

[jjkoshy] KAFKA-2163; The offsets manager's stale-offset-cleanup and offset 
load should be mutually exclusive; reviewed by Jun Rao

[jjkoshy] KAFKA-1660; Add API to the producer to support close with a timeout; 
reviewed by Joel Koshy and Jay Kreps.

--
[...truncated 531 lines...]
org.apache.kafka.common.record.RecordTest > testEquality[45] PASSED

org.apache.kafka.common.record.RecordTest > testFields[46] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[46] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[46] PASSED

org.apache.kafka.common.record.RecordTest > testFields[47] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[47] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[47] PASSED

org.apache.kafka.common.record.RecordTest > testFields[48] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[48] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[48] PASSED

org.apache.kafka.common.record.RecordTest > testFields[49] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[49] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[49] PASSED

org.apache.kafka.common.record.RecordTest > testFields[50] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[50] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[50] PASSED

org.apache.kafka.common.record.RecordTest > testFields[51] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[51] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[51] PASSED

org.apache.kafka.common.record.RecordTest > testFields[52] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[52] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[52] PASSED

org.apache.kafka.common.record.RecordTest > testFields[53] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[53] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[53] PASSED

org.apache.kafka.common.record.RecordTest > testFields[54] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[54] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[54] PASSED

org.apache.kafka.common.record.RecordTest > testFields[55] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[55] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[55] PASSED

org.apache.kafka.common.record.RecordTest > testFields[56] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[56] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[56] PASSED

org.apache.kafka.common.record.RecordTest > testFields[57] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[57] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[57] PASSED

org.apache.kafka.common.record.RecordTest > testFields[58] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[58] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[58] PASSED

org.apache.kafka.common.record.RecordTest > testFields[59] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[59] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[59] PASSED

org.apache.kafka.common.record.RecordTest > testFields[60] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[60] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[60] PASSED

org.apache.kafka.common.record.RecordTest > testFields[61] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[61] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[61] PASSED

org.apache.kafka.common.record.RecordTest > testFields[62] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[62] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[62] PASSED

org.apache.kafka.common.record.RecordTest > testFields[63] PASSED

org.apache.kafka.common.record.RecordTest > testChecksum[63] PASSED

org.apache.kafka.common.record.RecordTest > testEquality[63] PASSED
:contrib:compileJava UP-TO-DATE
:contrib:processResources UP-TO-DATE
:contrib:classes UP-TO-DATE
:contrib:compileTestJava UP-TO-DATE
:contrib:processTestResources UP-TO-DATE
:contrib:testClasses UP-TO-DATE
:contrib:test UP-TO-DATE
:clients:jar
:core:compileJava UP-TO-DATE
:core:compileScala
Download 
http://repo1.maven.org/maven2/com/typesafe/zinc/zinc/0.3.7/zinc-0.3.7.pom
Download 
http://repo1.maven.org/maven2/com/typesafe/sbt/incremental-compiler/0.13.7/incremental-compiler-0.13.7.pom
Download 
http://repo1.maven.org/maven2/com/typesafe/sbt/compiler-interface/0.13.7/compiler-interface-0.13.7.pom
Download 
http://repo1.maven.org/maven2/com/typesafe/sbt/sbt-interface/0.13.7/sbt-interfac

[jira] [Updated] (KAFKA-1690) new java producer needs ssl support as a client

2015-05-12 Thread Sriharsha Chintalapani (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriharsha Chintalapani updated KAFKA-1690:
--
Attachment: KAFKA-1690_2015-05-12_16:20:08.patch

> new java producer needs ssl support as a client
> ---
>
> Key: KAFKA-1690
> URL: https://issues.apache.org/jira/browse/KAFKA-1690
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Joe Stein
>Assignee: Sriharsha Chintalapani
> Fix For: 0.8.3
>
> Attachments: KAFKA-1690.patch, KAFKA-1690.patch, 
> KAFKA-1690_2015-05-10_23:20:30.patch, KAFKA-1690_2015-05-10_23:31:42.patch, 
> KAFKA-1690_2015-05-11_16:09:36.patch, KAFKA-1690_2015-05-12_16:20:08.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1690) new java producer needs ssl support as a client

2015-05-12 Thread Sriharsha Chintalapani (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540989#comment-14540989
 ] 

Sriharsha Chintalapani commented on KAFKA-1690:
---

Updated reviewboard https://reviews.apache.org/r/33620/diff/
 against branch origin/trunk

> new java producer needs ssl support as a client
> ---
>
> Key: KAFKA-1690
> URL: https://issues.apache.org/jira/browse/KAFKA-1690
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Joe Stein
>Assignee: Sriharsha Chintalapani
> Fix For: 0.8.3
>
> Attachments: KAFKA-1690.patch, KAFKA-1690.patch, 
> KAFKA-1690_2015-05-10_23:20:30.patch, KAFKA-1690_2015-05-10_23:31:42.patch, 
> KAFKA-1690_2015-05-11_16:09:36.patch, KAFKA-1690_2015-05-12_16:20:08.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33620: Patch for KAFKA-1690

2015-05-12 Thread Sriharsha Chintalapani

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33620/
---

(Updated May 12, 2015, 11:20 p.m.)


Review request for kafka.


Bugs: KAFKA-1690
https://issues.apache.org/jira/browse/KAFKA-1690


Repository: kafka


Description (updated)
---

KAFKA-1690. new java producer needs ssl support as a client.


KAFKA-1690. new java producer needs ssl support as a client.


KAFKA-1690. new java producer needs ssl support as a client.


KAFKA-1690. new java producer needs ssl support as a client. SSLFactory tests.


KAFKA-1690. new java producer needs ssl support as a client. Added 
PrincipalBuilder.


Diffs (updated)
-

  build.gradle fef515b3b2276b1f861e7cc2e33e74c3ce5e405b 
  checkstyle/checkstyle.xml a215ff36e9252879f1e0be5a86fef9a875bb8f38 
  checkstyle/import-control.xml f2e6cec267e67ce8e261341e373718e14a8e8e03 
  clients/src/main/java/org/apache/kafka/clients/ClientUtils.java 
0d68bf1e1e90fe9d5d4397ddf817b9a9af8d9f7a 
  clients/src/main/java/org/apache/kafka/clients/CommonClientConfigs.java 
cf32e4e7c40738fe6d8adc36ae0cfad459ac5b0b 
  clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java 
bdff518b732105823058e6182f445248b45dc388 
  clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
d301be4709f7b112e1f3a39f3c04cfa65f00fa60 
  clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java 
8e336a3aa96c73f52beaeb56b931baf4b026cf21 
  clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java 
187d0004c8c46b6664ddaffecc6166d4b47351e5 
  clients/src/main/java/org/apache/kafka/common/config/AbstractConfig.java 
c4fa058692f50abb4f47bd344119d805c60123f5 
  clients/src/main/java/org/apache/kafka/common/network/Authenticator.java 
PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/network/Channel.java 
PRE-CREATION 
  
clients/src/main/java/org/apache/kafka/common/network/DefaultAuthenticator.java 
PRE-CREATION 
  
clients/src/main/java/org/apache/kafka/common/network/PlainTextTransportLayer.java
 PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/network/SSLFactory.java 
PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/network/SSLTransportLayer.java 
PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/network/Selectable.java 
b5f8d83e89f9026dc0853e5f92c00b2d7f043e22 
  clients/src/main/java/org/apache/kafka/common/network/Selector.java 
57de0585e5e9a53eb9dcd99cac1ab3eb2086a302 
  clients/src/main/java/org/apache/kafka/common/network/TransportLayer.java 
PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/protocol/SecurityProtocol.java 
dab1a94dd29563688b6ecf4eeb0e180b06049d3f 
  
clients/src/main/java/org/apache/kafka/common/security/auth/DefaultPrincipalBuilder.java
 PRE-CREATION 
  
clients/src/main/java/org/apache/kafka/common/security/auth/KafkaPrincipal.java 
PRE-CREATION 
  
clients/src/main/java/org/apache/kafka/common/security/auth/PrincipalBuilder.java
 PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/utils/Utils.java 
f73eedb030987f018d8446bb1dcd98d19fa97331 
  clients/src/test/java/org/apache/kafka/common/network/EchoServer.java 
PRE-CREATION 
  clients/src/test/java/org/apache/kafka/common/network/SSLFactoryTest.java 
PRE-CREATION 
  clients/src/test/java/org/apache/kafka/common/network/SSLSelectorTest.java 
PRE-CREATION 
  clients/src/test/java/org/apache/kafka/common/network/SelectorTest.java 
d5b306b026e788b4e5479f3419805aa49ae889f3 
  clients/src/test/java/org/apache/kafka/common/utils/UtilsTest.java 
2ebe3c21f611dc133a2dbb8c7dfb0845f8c21498 
  clients/src/test/java/org/apache/kafka/test/TestSSLUtils.java PRE-CREATION 

Diff: https://reviews.apache.org/r/33620/diff/


Testing
---


Thanks,

Sriharsha Chintalapani



[jira] [Updated] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2189:
---
 Priority: Blocker  (was: Major)
Fix Version/s: 0.8.3
 Assignee: (was: Jay Kreps)
   Labels: trivial  (was: )

> Snappy compression of message batches less efficient in 0.8.2.1
> ---
>
> Key: KAFKA-2189
> URL: https://issues.apache.org/jira/browse/KAFKA-2189
> Project: Kafka
>  Issue Type: Bug
>  Components: build, compression, log
>Affects Versions: 0.8.2.1
>Reporter: Olson,Andrew
>Priority: Blocker
>  Labels: trivial
> Fix For: 0.8.3
>
>
> We are using snappy compression and noticed a fairly substantial increase 
> (about 2.25x) in log filesystem space consumption after upgrading a Kafka 
> cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages 
> being seemingly recompressed individually (or possibly with a much smaller 
> buffer or dictionary?) instead of as a batch as sent by producers. We 
> eventually tracked down the change in compression ratio/scope to this [1] 
> commit that updated the snappy version from 1.0.5 to 1.1.1.3. The Kafka 
> client version does not appear to be relevant as we can reproduce this with 
> both the 0.8.1.1 and 0.8.2.1 Producer.
> Here are the log files from our troubleshooting that contain the same set of 
> 1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
> commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.
> {noformat}
> -rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
> /var/kafka2/f9d9b-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
> /var/kafka2/f9d9b-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
> /var/kafka2/f9d9b-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
> /var/kafka2/f9d9b-batch-1000-0/.log
> -rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
> /var/kafka2/f5ab8-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
> /var/kafka2/f5ab8-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
> /var/kafka2/f5ab8-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
> /var/kafka2/f5ab8-batch-1000-0/.log
> {noformat}
> [1] 
> https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2175) Reduce server log verbosity at info level

2015-05-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2175:
--
Resolution: Pending Closed
Status: Resolved  (was: Patch Available)

+1

Thanks for the patch - committed to trunk.

> Reduce server log verbosity at info level
> -
>
> Key: KAFKA-2175
> URL: https://issues.apache.org/jira/browse/KAFKA-2175
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, zkclient
>Affects Versions: 0.8.3
>Reporter: Todd Palino
>Assignee: Todd Palino
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-2175.patch
>
>
> Currently, the broker logs two messages at INFO level that should be at a 
> lower level. This serves only to fill up log files on disk, and can cause 
> performance issues due to synchronous logging as well.
> The first is the "Closing socket connection" message when there is no error. 
> This should be reduced to debug level. The second is the message that ZkUtil 
> writes when updating the partition reassignment JSON. This message contains 
> the entire JSON blob and should never be written at info level. In addition, 
> there is already a message in the controller log stating that the ZK node has 
> been updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540921#comment-14540921
 ] 

Jun Rao commented on KAFKA-2147:


Looking a bit more at our code, it seems that this could be caused by a bug in 
our code. We didn't synchronize when getting the watcher size. Therefore, the 
cleaner thread may see an outdated value, which can cause the cleaning to be 
delayed. I am attaching a patch to the 0.8.2 branch. Could you try this out and 
see if this addresses your issue?

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Jun Rao
> Attachments: KAFKA-2147.patch, watch-lists.log
>
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we developed a hypothesis (detailed below) which guided our 
> subsequent debugging tests:
> - Setting a daemon up to produce regular random data to all of the topics led 
> by kafka06 (specifically the ones which otherwise would receive no data) 
> substantially alleviated the problem.
> - Doing an additional rebalance of the cluster in order to move a numb

[jira] [Updated] (KAFKA-1660) Ability to call close() with a timeout on the Java Kafka Producer.

2015-05-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-1660:
--
Resolution: Pending Closed
Status: Resolved  (was: Patch Available)

Thanks for the patch - committed to trunk.

> Ability to call close() with a timeout on the Java Kafka Producer. 
> ---
>
> Key: KAFKA-1660
> URL: https://issues.apache.org/jira/browse/KAFKA-1660
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, producer 
>Affects Versions: 0.8.2.0
>Reporter: Andrew Stein
>Assignee: Jiangjie Qin
> Fix For: 0.8.3
>
> Attachments: KAFKA-1660.patch, KAFKA-1660.patch, 
> KAFKA-1660_2015-02-17_16:41:19.patch, KAFKA-1660_2015-03-02_10:41:49.patch, 
> KAFKA-1660_2015-03-08_21:14:50.patch, KAFKA-1660_2015-03-09_12:56:39.patch, 
> KAFKA-1660_2015-03-25_10:55:42.patch, KAFKA-1660_2015-03-27_16:35:42.patch, 
> KAFKA-1660_2015-04-07_18:18:40.patch, KAFKA-1660_2015-04-08_14:01:12.patch, 
> KAFKA-1660_2015-04-10_15:08:54.patch, KAFKA-1660_2015-04-16_11:35:37.patch, 
> KAFKA-1660_2015-04-20_17:38:22.patch, KAFKA-1660_2015-04-29_16:58:43.patch, 
> KAFKA-1660_2015-04-29_17:37:51.patch, KAFKA-1660_2015-05-12_14:28:53.patch
>
>
> I would like the ability to call {{close}} with a timeout on the Java 
> Client's KafkaProducer.
> h6. Workaround
> Currently, it is possible to ensure that {{close}} will return quickly by 
> first doing a {{future.get(timeout)}} on the last future produced on each 
> partition, but this means that the user has to define the partitions up front 
> at the time of {{send}} and track the returned {{future}}'s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2163) Offsets manager cache should prevent stale-offset-cleanup while an offset load is in progress; otherwise we can lose consumer offsets

2015-05-12 Thread Joel Koshy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Koshy updated KAFKA-2163:
--
Resolution: Pending Closed
Status: Resolved  (was: Patch Available)

Committed to trunk.

> Offsets manager cache should prevent stale-offset-cleanup while an offset 
> load is in progress; otherwise we can lose consumer offsets
> -
>
> Key: KAFKA-2163
> URL: https://issues.apache.org/jira/browse/KAFKA-2163
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
>Assignee: Joel Koshy
> Fix For: 0.8.3
>
> Attachments: KAFKA-2163.patch
>
>
> When leadership of an offsets partition moves, the new leader loads offsets 
> from that partition into the offset manager cache.
> Independently, the offset manager has a periodic cleanup task for stale 
> offsets that removes old offsets from the cache and appends tombstones for 
> those. If the partition happens to contain much older offsets (earlier in the 
> log) and inserts those into the cache; the cleanup task may run and see those 
> offsets (which it deems to be stale) and proceeds to remove from the cache 
> and append a tombstone to the end of the log. The tombstone will override the 
> true latest offset and a subsequent offset fetch request will return no 
> offset.
> We just need to prevent the cleanup task from running during an offset load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
Assignee: Jun Rao  (was: Joel Koshy)
  Status: Patch Available  (was: Open)

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Jun Rao
> Attachments: KAFKA-2147.patch, watch-lists.log
>
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we developed a hypothesis (detailed below) which guided our 
> subsequent debugging tests:
> - Setting a daemon up to produce regular random data to all of the topics led 
> by kafka06 (specifically the ones which otherwise would receive no data) 
> substantially alleviated the problem.
> - Doing an additional rebalance of the cluster in order to move a number of 
> other topics with regular data to kafka06 appears to have solved the problem 
> completely.
> h4. Hypothesis
> Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
> not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
> in a very specific situation, the replica fetc

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
Attachment: KAFKA-2147.patch

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
> Attachments: KAFKA-2147.patch, watch-lists.log
>
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we developed a hypothesis (detailed below) which guided our 
> subsequent debugging tests:
> - Setting a daemon up to produce regular random data to all of the topics led 
> by kafka06 (specifically the ones which otherwise would receive no data) 
> substantially alleviated the problem.
> - Doing an additional rebalance of the cluster in order to move a number of 
> other topics with regular data to kafka06 appears to have solved the problem 
> completely.
> h4. Hypothesis
> Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
> not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
> in a very specific situation, the replica fetcher thread of one broker can 
> spam another bro

[jira] [Commented] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540917#comment-14540917
 ] 

Jun Rao commented on KAFKA-2147:


Created reviewboard https://reviews.apache.org/r/34125/diff/
 against branch origin/0.8.2

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
> Attachments: KAFKA-2147.patch, watch-lists.log
>
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we developed a hypothesis (detailed below) which guided our 
> subsequent debugging tests:
> - Setting a daemon up to produce regular random data to all of the topics led 
> by kafka06 (specifically the ones which otherwise would receive no data) 
> substantially alleviated the problem.
> - Doing an additional rebalance of the cluster in order to move a number of 
> other topics with regular data to kafka06 appears to have solved the problem 
> completely.
> h4. Hypothesis
> Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
> not backing off correctly (KAFKA-1461, KAFKA-2082 and others

Review Request 34125: Patch for KAFKA-2147

2015-05-12 Thread Jun Rao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34125/
---

Review request for kafka.


Bugs: KAFKA-2147
https://issues.apache.org/jira/browse/KAFKA-2147


Repository: kafka


Description
---

synchronize on getting size from watchers


Diffs
-

  core/src/main/scala/kafka/server/RequestPurgatory.scala 
87ee3bec9c57d7322f1b5d8caa74b1c8415bcf49 

Diff: https://reviews.apache.org/r/34125/diff/


Testing
---


Thanks,

Jun Rao



Re: [KIP-DISCUSSION] KIP-22 Expose a Partitioner interface in the new producer

2015-05-12 Thread Sriharsha Chintalapani
Hi Jiangjie,
       Its under 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22++Expose+a+Partitioner+interface+in+the+new+producer
I checked other KIPS they are under /KAFKA  as well.

Thanks,
Harsha
On May 12, 2015 at 2:12:30 PM, Jiangjie Qin (j...@linkedin.com.invalid) wrote:

Hey Harsha,  

It looks you created the KIP page at wrong place. . . Can you move the  
page to a child page of  
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposa  
ls  

Thanks.  

Jiangjie (Becket) Qin  

On 5/6/15, 6:12 PM, "Harsha"  wrote:  

>Thanks for the review Joel. I agree don't need a init method we can use  
>configure. I'll update the KIP.  
>-Harsha  
>  
>On Wed, May 6, 2015, at 04:45 PM, Joel Koshy wrote:  
>> +1 with a minor comment: do we need an init method given it extends  
>> Configurable?  
>>  
>> Also, can you move this wiki out of drafts and add it to the table in  
>>  
>>https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Propo  
>>sals?  
>>  
>> Thanks,  
>>  
>> Joel  
>>  
>> On Wed, May 06, 2015 at 07:46:46AM -0700, Sriharsha Chintalapani wrote:  
>> > Thanks Jay. I removed partitioner.metadata from KIP. I¹ll send an  
>>updated patch.  
>> >  
>> > --  
>> > Harsha  
>> > Sent with Airmail  
>> >  
>> > On May 5, 2015 at 6:31:47 AM, Sriharsha Chintalapani  
>>(harsh...@fastmail.fm) wrote:  
>> >  
>> > Thanks for the comments everyone.  
>> > Hi Jay,  
>> > I do have a question regarding configurable interface on how to  
>>pass a Map properties. I couldn¹t find any other classes  
>>using it. JMX reporter overrides it but doesn¹t implement it. So with  
>>configurable partitioner how can a user pass in partitioner  
>>configuration since its getting instantiated within the producer.  
>> >  
>> > Thanks,  
>> > Harsha  
>> >  
>> >  
>> > On May 4, 2015 at 10:36:45 AM, Jay Kreps (jay.kr...@gmail.com) wrote:  
>> >  
>> > Hey Harsha,  
>> >  
>> > That proposal sounds good. One minor thing--I don't think we need to  
>>have  
>> > the partitioner.metadata property. Our reason for using string  
>>properties  
>> > is exactly to make config extensible at runtime. So a given  
>>partitioner can  
>> > add whatever properties make sense using the configure() api it  
>>defines.  
>> >  
>> > -Jay  
>> >  
>> > On Sun, May 3, 2015 at 5:57 PM, Harsha  wrote:  
>> >  
>> > > Thanks Jay & Gianmarco for the comments. I picked the option A, if  
>>user  
>> > > sends a partition id than it will applied and partitioner.class  
>>method  
>> > > will only called if partition id is null .  
>> > > Please take a look at the updated KIP here  
>> > >  
>> > >  
>>https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22+-+Expose+a+Part  
>>itioner+interface+in+the+new+producer  
>> > > . Let me know if you see anything missing.  
>> > >  
>> > > Thanks,  
>> > > Harsha  
>> > >  
>> > > On Fri, Apr 24, 2015, at 02:15 AM, Gianmarco De Francisci Morales  
>>wrote:  
>> > > > Hi,  
>> > > >  
>> > > >  
>> > > > Here are the questions I think we should consider:  
>> > > > > 1. Do we need this at all given that we have the partition  
>>argument in  
>> > > > > ProducerRecord which gives full control? I think we do need it  
>>because  
>> > > this  
>> > > > > is a way to plug in a different partitioning strategy at run  
>>time and  
>> > > do it  
>> > > > > in a fairly transparent way.  
>> > > > >  
>> > > >  
>> > > > Yes, we need it if we want to support different partitioning  
>>strategies  
>> > > > inside Kafka rather than requiring the user to code them  
>>externally.  
>> > > >  
>> > > >  
>> > > > > 3. Do we need to add the value? I suspect people will have uses  
>>for  
>> > > > > computing something off a few fields in the value to choose the  
>> > > partition.  
>> > > > > This would be useful in cases where the key was being used for  
>>log  
>> > > > > compaction purposes and did not contain the full information for  
>> > > computing  
>> > > > > the partition.  
>> > > > >  
>> > > >  
>> > > > I am not entirely sure about this. I guess that most partitioners  
>>should  
>> > > > not use it.  
>> > > > I think it makes it easier to reason about the system if the  
>>partitioner  
>> > > > only works on the key.  
>> > > > Hoever, if the value (and its serialization) are already  
>>available, there  
>> > > > is not much harm in passing them along.  
>> > > >  
>> > > >  
>> > > > > 4. This interface doesn't include either an init() or close()  
>>method.  
>> > > It  
>> > > > > should implement Closable and Configurable, right?  
>> > > > >  
>> > > >  
>> > > > Right now the only application I can think of to have an init()  
>>and  
>> > > > close()  
>> > > > is to read some state information (e.g., load information) that is  
>> > > > published on some external distributed storage (e.g., zookeeper)  
>>by the  
>> > > > brokers.  
>> > > > It might be useful also for reconfiguration and state migration.  
>> > > >  
>> > > > I thi

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Evan Huus (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Huus updated KAFKA-2147:
-
Attachment: watch-lists.log

watch list purging logs

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
> Attachments: watch-lists.log
>
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we developed a hypothesis (detailed below) which guided our 
> subsequent debugging tests:
> - Setting a daemon up to produce regular random data to all of the topics led 
> by kafka06 (specifically the ones which otherwise would receive no data) 
> substantially alleviated the problem.
> - Doing an additional rebalance of the cluster in order to move a number of 
> other topics with regular data to kafka06 appears to have solved the problem 
> completely.
> h4. Hypothesis
> Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
> not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
> in a very specific situation, the replica fetcher thread of one broker can 
> spam a

Re: [DISCUSS] KIP-19 Add a request timeout to NetworkClient

2015-05-12 Thread Mayuresh Gharat
+1 Becket. That would give enough time for clients to move. We should make
this change very clear.

Thanks,

Mayuresh

On Tue, May 12, 2015 at 1:45 PM, Jiangjie Qin 
wrote:

> Hey Ewen,
>
> Very good summary about the compatibility. What you proposed makes sense.
> So basically we can do the following:
>
> In next release, i.e. 0.8.3:
> 1. Add REPLICATION_TIMEOUT_CONFIG (“replication.timeout.ms”)
> 2. Mark TIMEOUT_CONFIG as deprecated
> 3. Override REPLICATION_TIMEOUT_CONFIG with TIMEOUT_CONFIG if it is
> defined and give a warning about deprecation.
> In the release after 0.8.3, we remove TIMEOUT_CONFIG.
>
> This should give enough buffer for this change.
>
> Request timeout is a complete new thing we add to fix a bug, I’m with you
> it does not make sense to have it maintain the old buggy behavior. So we
> can set it to a reasonable value instead of infinite.
>
> Jiangjie (Becket) Qin
>
> On 5/12/15, 12:03 PM, "Ewen Cheslack-Postava"  wrote:
>
> >I think my confusion is coming from this:
> >
> >> So in this KIP, we only address (3). The only public interface change
> >>is a
> >> new configuration of request timeout (and maybe change the configuration
> >> name of TIMEOUT_CONFIG to REPLICATION_TIMEOUT_CONFIG).
> >
> >There are 3 possible compatibility issues I see here:
> >
> >* I assumed this meant the constants also change, so "timeout.ms" becomes
> >"
> >replication.timeout.ms". This breaks config files that worked on the
> >previous version and the only warning would be in release notes. We do
> >warn
> >about unused configs so they might notice the problem.
> >
> >* Binary and source compatibility if someone configures their client in
> >code and uses the TIMEOUT_CONFIG variable. Renaming it will cause existing
> >jars to break if you try to run against an updated client (which seems not
> >very significant since I doubt people upgrade these without recompiling
> >but
> >maybe I'm wrong about that). And it breaks builds without have deprecated
> >that field first, which again, is probably not the biggest issue but is
> >annoying for users and when we accidentally changed the API we received a
> >complaint about breaking builds.
> >
> >* Behavior compatibility as Jay mentioned on the call -- setting the
> >config
> >(even if the name changed) doesn't have the same effect it used to.
> >
> >One solution, which admittedly is more painful to implement and maintain,
> >would be to maintain the timeout.ms config, have it override the others
> if
> >it is specified (including an infinite request timeout I guess?), and if
> >it
> >isn't specified, we can just use the new config variables. Given a real
> >deprecation schedule, users would have better warning of changes and a
> >window to make the changes.
> >
> >I actually think it might not be necessary to maintain the old behavior
> >precisely, although maybe for some code it is an issue if they start
> >seeing
> >timeout exceptions that they wouldn't have seen before?
> >
> >-Ewen
> >
> >On Wed, May 6, 2015 at 6:06 PM, Jun Rao  wrote:
> >
> >> Jiangjie,
> >>
> >> Yes, I think using metadata timeout to expire batches in the record
> >> accumulator makes sense.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Mon, May 4, 2015 at 10:32 AM, Jiangjie Qin
> >>
> >> wrote:
> >>
> >> > I incorporated Ewen and Guozhang’s comments in the KIP page. Want to
> >> speed
> >> > up on this KIP because currently we experience mirror-maker hung very
> >> > likely when a broker is down.
> >> >
> >> > I also took a shot to solve KAFKA-1788 in KAFKA-2142. I used metadata
> >> > timeout to expire the batches which are sitting in accumulator without
> >> > leader info. I did that because the situation there is essentially
> >> missing
> >> > metadata.
> >> >
> >> > As a summary of what I am thinking about the timeout in new Producer:
> >> >
> >> > 1. Metadata timeout:
> >> >   - used in send(), blocking
> >> >   - used in accumulator to expire batches with timeout exception.
> >> > 2. Linger.ms
> >> >   - Used in accumulator to ready the batch for drain
> >> > 3. Request timeout
> >> >   - Used in NetworkClient to expire a batch and retry if no response
> >>is
> >> > received for a request before timeout.
> >> >
> >> > So in this KIP, we only address (3). The only public interface change
> >>is
> >> a
> >> > new configuration of request timeout (and maybe change the
> >>configuration
> >> > name of TIMEOUT_CONFIG to REPLICATION_TIMEOUT_CONFIG).
> >> >
> >> > Would like to see what people think of above approach?
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > On 4/20/15, 6:02 PM, "Jiangjie Qin"  wrote:
> >> >
> >> > >Jun,
> >> > >
> >> > >I thought a little bit differently on this.
> >> > >Intuitively, I am thinking that if a partition is offline, the
> >>metadata
> >> > >for that partition should be considered not ready because we don’t
> >>know
> >> > >which broker we should send the message to. So those sends need to be
> >> > >blocked on metadata timeout.
> >> > >Another thing

[jira] [Commented] (KAFKA-2136) Client side protocol changes to return quota delays

2015-05-12 Thread Aditya A Auradkar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540818#comment-14540818
 ] 

Aditya A Auradkar commented on KAFKA-2136:
--

Updated reviewboard https://reviews.apache.org/r/33378/diff/
 against branch origin/trunk

> Client side protocol changes to return quota delays
> ---
>
> Key: KAFKA-2136
> URL: https://issues.apache.org/jira/browse/KAFKA-2136
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
> Attachments: KAFKA-2136.patch, KAFKA-2136_2015-05-06_18:32:48.patch, 
> KAFKA-2136_2015-05-06_18:35:54.patch, KAFKA-2136_2015-05-11_14:50:56.patch, 
> KAFKA-2136_2015-05-12_14:40:44.patch
>
>
> As described in KIP-13, evolve the protocol to return a throttle_time_ms in 
> the Fetch and the ProduceResponse objects. Add client side metrics on the new 
> producer and consumer to expose the delay time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33378: Patch for KAFKA-2136

2015-05-12 Thread Aditya Auradkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33378/
---

(Updated May 12, 2015, 9:42 p.m.)


Review request for kafka, Joel Koshy and Jun Rao.


Bugs: KAFKA-2136
https://issues.apache.org/jira/browse/KAFKA-2136


Repository: kafka


Description (updated)
---

Changes are
- protocol changes to the fetch request and response to return the 
throttle_time_ms to clients
- New producer/consumer metrics to expose the avg and max delay time for a 
client
- Test cases.

For now the patch will publish a zero delay and return a response


Diffs
-

  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java 
ef9dd5238fbc771496029866ece1d85db6d7b7a5 
  clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
b2db91ca14bbd17fef5ce85839679144fff3f689 
  clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java 
3dc8b015afd2347a41c9a9dbc02b8e367da5f75f 
  clients/src/main/java/org/apache/kafka/common/requests/FetchRequest.java 
8686d83aa52e435c6adafbe9ff4bd1602281072a 
  clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java 
eb8951fba48c335095cc43fc3672de1c733e07ff 
  clients/src/main/java/org/apache/kafka/common/requests/ProduceRequest.java 
fabeae3083a8ea55cdacbb9568f3847ccd85bab4 
  clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java 
37ec0b79beafcf5735c386b066eb319fb697eff5 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
 419541011d652becf0cda7a5e62ce813cddb1732 
  
clients/src/test/java/org/apache/kafka/clients/producer/internals/SenderTest.java
 8b1805d3d2bcb9fe2bacb37d870c3236aa9532c4 
  
clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java 
e3cc1967e407b64cc734548c19e30de700b64ba8 
  core/src/main/scala/kafka/api/FetchRequest.scala 
b038c15186c0cbcc65b59479324052498361b717 
  core/src/main/scala/kafka/api/FetchResponse.scala 
75aaf57fb76ec01660d93701a57ae953d877d81c 
  core/src/main/scala/kafka/api/ProducerRequest.scala 
570b2da1d865086f9830aa919a49063abbbe574d 
  core/src/main/scala/kafka/api/ProducerResponse.scala 
5d1fac4cb8943f5bfaa487f8e9d9d2856efbd330 
  core/src/main/scala/kafka/consumer/SimpleConsumer.scala 
31a2639477bf66f9a05d2b9b07794572d7ec393b 
  core/src/main/scala/kafka/server/AbstractFetcherThread.scala 
83fc47417dd7fe4edf030217fa7fd69d99b170b0 
  core/src/main/scala/kafka/server/DelayedFetch.scala 
de6cf5bdaa0e70394162febc63b50b55ca0a92db 
  core/src/main/scala/kafka/server/DelayedProduce.scala 
05078b24ef28f2f4e099afa943e43f1d00359fda 
  core/src/main/scala/kafka/server/KafkaApis.scala 
417960dd1ab407ebebad8fdb0e97415db3e91a2f 
  core/src/main/scala/kafka/server/OffsetManager.scala 
18680ce100f10035175cc0263ba7787ab0f6a17a 
  core/src/main/scala/kafka/server/ReplicaFetcherThread.scala 
b31b432a226ba79546dd22ef1d2acbb439c2e9a3 
  core/src/main/scala/kafka/server/ReplicaManager.scala 
59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
  core/src/test/scala/unit/kafka/api/RequestResponseSerializationTest.scala 
5717165f2344823fabe8f7cfafae4bb8af2d949a 
  core/src/test/scala/unit/kafka/server/DelayedOperationTest.scala 
f3ab3f4ff8eb1aa6b2ab87ba75f72eceb6649620 
  core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala 
00d59337a99ac135e8689bd1ecd928f7b1423d79 

Diff: https://reviews.apache.org/r/33378/diff/


Testing
---

New tests added


Thanks,

Aditya Auradkar



Re: Review Request 33378: Patch for KAFKA-2136

2015-05-12 Thread Aditya Auradkar


> On May 12, 2015, 7:39 p.m., Dong Lin wrote:
> > core/src/main/scala/kafka/server/ReplicaManager.scala, line 428
> > 
> >
> > Will readFromLocalLog() read data into memory before the cilent's quota 
> > is checked?

It shouldn't. I believe only FileMessageSet is passed around which only 
contains start and end pointers of file segments. The actual read is done later.


- Aditya


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33378/#review83439
---


On May 12, 2015, 9:40 p.m., Aditya Auradkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33378/
> ---
> 
> (Updated May 12, 2015, 9:40 p.m.)
> 
> 
> Review request for kafka, Joel Koshy and Jun Rao.
> 
> 
> Bugs: KAFKA-2136
> https://issues.apache.org/jira/browse/KAFKA-2136
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Minor bug fix
> 
> 
> Diffs
> -
> 
>   
> clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java
>  ef9dd5238fbc771496029866ece1d85db6d7b7a5 
>   
> clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
> b2db91ca14bbd17fef5ce85839679144fff3f689 
>   clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java 
> 3dc8b015afd2347a41c9a9dbc02b8e367da5f75f 
>   clients/src/main/java/org/apache/kafka/common/requests/FetchRequest.java 
> 8686d83aa52e435c6adafbe9ff4bd1602281072a 
>   clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java 
> eb8951fba48c335095cc43fc3672de1c733e07ff 
>   clients/src/main/java/org/apache/kafka/common/requests/ProduceRequest.java 
> fabeae3083a8ea55cdacbb9568f3847ccd85bab4 
>   clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java 
> 37ec0b79beafcf5735c386b066eb319fb697eff5 
>   
> clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
>  419541011d652becf0cda7a5e62ce813cddb1732 
>   
> clients/src/test/java/org/apache/kafka/clients/producer/internals/SenderTest.java
>  8b1805d3d2bcb9fe2bacb37d870c3236aa9532c4 
>   
> clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java
>  e3cc1967e407b64cc734548c19e30de700b64ba8 
>   core/src/main/scala/kafka/api/FetchRequest.scala 
> b038c15186c0cbcc65b59479324052498361b717 
>   core/src/main/scala/kafka/api/FetchResponse.scala 
> 75aaf57fb76ec01660d93701a57ae953d877d81c 
>   core/src/main/scala/kafka/api/ProducerRequest.scala 
> 570b2da1d865086f9830aa919a49063abbbe574d 
>   core/src/main/scala/kafka/api/ProducerResponse.scala 
> 5d1fac4cb8943f5bfaa487f8e9d9d2856efbd330 
>   core/src/main/scala/kafka/consumer/SimpleConsumer.scala 
> 31a2639477bf66f9a05d2b9b07794572d7ec393b 
>   core/src/main/scala/kafka/server/AbstractFetcherThread.scala 
> 83fc47417dd7fe4edf030217fa7fd69d99b170b0 
>   core/src/main/scala/kafka/server/DelayedFetch.scala 
> de6cf5bdaa0e70394162febc63b50b55ca0a92db 
>   core/src/main/scala/kafka/server/DelayedProduce.scala 
> 05078b24ef28f2f4e099afa943e43f1d00359fda 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/OffsetManager.scala 
> 18680ce100f10035175cc0263ba7787ab0f6a17a 
>   core/src/main/scala/kafka/server/ReplicaFetcherThread.scala 
> b31b432a226ba79546dd22ef1d2acbb439c2e9a3 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/test/scala/unit/kafka/api/RequestResponseSerializationTest.scala 
> 5717165f2344823fabe8f7cfafae4bb8af2d949a 
>   core/src/test/scala/unit/kafka/server/DelayedOperationTest.scala 
> f3ab3f4ff8eb1aa6b2ab87ba75f72eceb6649620 
>   core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala 
> 00d59337a99ac135e8689bd1ecd928f7b1423d79 
> 
> Diff: https://reviews.apache.org/r/33378/diff/
> 
> 
> Testing
> ---
> 
> New tests added
> 
> 
> Thanks,
> 
> Aditya Auradkar
> 
>



[jira] [Updated] (KAFKA-2136) Client side protocol changes to return quota delays

2015-05-12 Thread Aditya A Auradkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya A Auradkar updated KAFKA-2136:
-
Attachment: KAFKA-2136_2015-05-12_14:40:44.patch

> Client side protocol changes to return quota delays
> ---
>
> Key: KAFKA-2136
> URL: https://issues.apache.org/jira/browse/KAFKA-2136
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Aditya Auradkar
> Attachments: KAFKA-2136.patch, KAFKA-2136_2015-05-06_18:32:48.patch, 
> KAFKA-2136_2015-05-06_18:35:54.patch, KAFKA-2136_2015-05-11_14:50:56.patch, 
> KAFKA-2136_2015-05-12_14:40:44.patch
>
>
> As described in KIP-13, evolve the protocol to return a throttle_time_ms in 
> the Fetch and the ProduceResponse objects. Add client side metrics on the new 
> producer and consumer to expose the delay time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33378: Patch for KAFKA-2136

2015-05-12 Thread Aditya Auradkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33378/
---

(Updated May 12, 2015, 9:40 p.m.)


Review request for kafka, Joel Koshy and Jun Rao.


Bugs: KAFKA-2136
https://issues.apache.org/jira/browse/KAFKA-2136


Repository: kafka


Description (updated)
---

Minor bug fix


Diffs (updated)
-

  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java 
ef9dd5238fbc771496029866ece1d85db6d7b7a5 
  clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
b2db91ca14bbd17fef5ce85839679144fff3f689 
  clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java 
3dc8b015afd2347a41c9a9dbc02b8e367da5f75f 
  clients/src/main/java/org/apache/kafka/common/requests/FetchRequest.java 
8686d83aa52e435c6adafbe9ff4bd1602281072a 
  clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java 
eb8951fba48c335095cc43fc3672de1c733e07ff 
  clients/src/main/java/org/apache/kafka/common/requests/ProduceRequest.java 
fabeae3083a8ea55cdacbb9568f3847ccd85bab4 
  clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java 
37ec0b79beafcf5735c386b066eb319fb697eff5 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
 419541011d652becf0cda7a5e62ce813cddb1732 
  
clients/src/test/java/org/apache/kafka/clients/producer/internals/SenderTest.java
 8b1805d3d2bcb9fe2bacb37d870c3236aa9532c4 
  
clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java 
e3cc1967e407b64cc734548c19e30de700b64ba8 
  core/src/main/scala/kafka/api/FetchRequest.scala 
b038c15186c0cbcc65b59479324052498361b717 
  core/src/main/scala/kafka/api/FetchResponse.scala 
75aaf57fb76ec01660d93701a57ae953d877d81c 
  core/src/main/scala/kafka/api/ProducerRequest.scala 
570b2da1d865086f9830aa919a49063abbbe574d 
  core/src/main/scala/kafka/api/ProducerResponse.scala 
5d1fac4cb8943f5bfaa487f8e9d9d2856efbd330 
  core/src/main/scala/kafka/consumer/SimpleConsumer.scala 
31a2639477bf66f9a05d2b9b07794572d7ec393b 
  core/src/main/scala/kafka/server/AbstractFetcherThread.scala 
83fc47417dd7fe4edf030217fa7fd69d99b170b0 
  core/src/main/scala/kafka/server/DelayedFetch.scala 
de6cf5bdaa0e70394162febc63b50b55ca0a92db 
  core/src/main/scala/kafka/server/DelayedProduce.scala 
05078b24ef28f2f4e099afa943e43f1d00359fda 
  core/src/main/scala/kafka/server/KafkaApis.scala 
417960dd1ab407ebebad8fdb0e97415db3e91a2f 
  core/src/main/scala/kafka/server/OffsetManager.scala 
18680ce100f10035175cc0263ba7787ab0f6a17a 
  core/src/main/scala/kafka/server/ReplicaFetcherThread.scala 
b31b432a226ba79546dd22ef1d2acbb439c2e9a3 
  core/src/main/scala/kafka/server/ReplicaManager.scala 
59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
  core/src/test/scala/unit/kafka/api/RequestResponseSerializationTest.scala 
5717165f2344823fabe8f7cfafae4bb8af2d949a 
  core/src/test/scala/unit/kafka/server/DelayedOperationTest.scala 
f3ab3f4ff8eb1aa6b2ab87ba75f72eceb6649620 
  core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala 
00d59337a99ac135e8689bd1ecd928f7b1423d79 

Diff: https://reviews.apache.org/r/33378/diff/


Testing
---

New tests added


Thanks,

Aditya Auradkar



[jira] [Updated] (KAFKA-1660) Ability to call close() with a timeout on the Java Kafka Producer.

2015-05-12 Thread Jiangjie Qin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiangjie Qin updated KAFKA-1660:

Attachment: KAFKA-1660_2015-05-12_14:28:53.patch

> Ability to call close() with a timeout on the Java Kafka Producer. 
> ---
>
> Key: KAFKA-1660
> URL: https://issues.apache.org/jira/browse/KAFKA-1660
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, producer 
>Affects Versions: 0.8.2.0
>Reporter: Andrew Stein
>Assignee: Jiangjie Qin
> Fix For: 0.8.3
>
> Attachments: KAFKA-1660.patch, KAFKA-1660.patch, 
> KAFKA-1660_2015-02-17_16:41:19.patch, KAFKA-1660_2015-03-02_10:41:49.patch, 
> KAFKA-1660_2015-03-08_21:14:50.patch, KAFKA-1660_2015-03-09_12:56:39.patch, 
> KAFKA-1660_2015-03-25_10:55:42.patch, KAFKA-1660_2015-03-27_16:35:42.patch, 
> KAFKA-1660_2015-04-07_18:18:40.patch, KAFKA-1660_2015-04-08_14:01:12.patch, 
> KAFKA-1660_2015-04-10_15:08:54.patch, KAFKA-1660_2015-04-16_11:35:37.patch, 
> KAFKA-1660_2015-04-20_17:38:22.patch, KAFKA-1660_2015-04-29_16:58:43.patch, 
> KAFKA-1660_2015-04-29_17:37:51.patch, KAFKA-1660_2015-05-12_14:28:53.patch
>
>
> I would like the ability to call {{close}} with a timeout on the Java 
> Client's KafkaProducer.
> h6. Workaround
> Currently, it is possible to ensure that {{close}} will return quickly by 
> first doing a {{future.get(timeout)}} on the last future produced on each 
> partition, but this means that the user has to define the partitions up front 
> at the time of {{send}} and track the returned {{future}}'s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1660) Ability to call close() with a timeout on the Java Kafka Producer.

2015-05-12 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540793#comment-14540793
 ] 

Jiangjie Qin commented on KAFKA-1660:
-

Updated reviewboard https://reviews.apache.org/r/31850/diff/
 against branch origin/trunk

> Ability to call close() with a timeout on the Java Kafka Producer. 
> ---
>
> Key: KAFKA-1660
> URL: https://issues.apache.org/jira/browse/KAFKA-1660
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, producer 
>Affects Versions: 0.8.2.0
>Reporter: Andrew Stein
>Assignee: Jiangjie Qin
> Fix For: 0.8.3
>
> Attachments: KAFKA-1660.patch, KAFKA-1660.patch, 
> KAFKA-1660_2015-02-17_16:41:19.patch, KAFKA-1660_2015-03-02_10:41:49.patch, 
> KAFKA-1660_2015-03-08_21:14:50.patch, KAFKA-1660_2015-03-09_12:56:39.patch, 
> KAFKA-1660_2015-03-25_10:55:42.patch, KAFKA-1660_2015-03-27_16:35:42.patch, 
> KAFKA-1660_2015-04-07_18:18:40.patch, KAFKA-1660_2015-04-08_14:01:12.patch, 
> KAFKA-1660_2015-04-10_15:08:54.patch, KAFKA-1660_2015-04-16_11:35:37.patch, 
> KAFKA-1660_2015-04-20_17:38:22.patch, KAFKA-1660_2015-04-29_16:58:43.patch, 
> KAFKA-1660_2015-04-29_17:37:51.patch, KAFKA-1660_2015-05-12_14:28:53.patch
>
>
> I would like the ability to call {{close}} with a timeout on the Java 
> Client's KafkaProducer.
> h6. Workaround
> Currently, it is possible to ensure that {{close}} will return quickly by 
> first doing a {{future.get(timeout)}} on the last future produced on each 
> partition, but this means that the user has to define the partitions up front 
> at the time of {{send}} and track the returned {{future}}'s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 31850: Patch for KAFKA-1660

2015-05-12 Thread Jiangjie Qin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31850/
---

(Updated May 12, 2015, 9:29 p.m.)


Review request for kafka.


Bugs: KAFKA-1660
https://issues.apache.org/jira/browse/KAFKA-1660


Repository: kafka


Description (updated)
---

A minor fix.


Incorporated Guozhang's comments.


Modify according to the latest conclusion.


Patch for the finally passed KIP-15git status


Addressed Joel and Guozhang's comments.


rebased on trunk


Rebase on trunk


Addressed Joel's comments.


Addressed Joel's comments


Addressed Jay's comments


Change java doc as Jay suggested.


Go back to the AtomicInteger approach for less dependency.


Rebased on trunk


Add some missing code from rebase.


Rebased on trunk and incorporated Joel's comments


Rebased on trunk and incorporated Joel's comments


Rebased on trunk and incorporated Joel's comments


Diffs (updated)
-

  clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java 
42b12928781463b56fc4a45d96bb4da2745b6d95 
  clients/src/main/java/org/apache/kafka/clients/producer/MockProducer.java 
6913090af03a455452b0b5c3df78f266126b3854 
  clients/src/main/java/org/apache/kafka/clients/producer/Producer.java 
5b3e75ed83a940bc922f9eca10d4008d67aa37c9 
  
clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java
 49a98838767615dd952da20825f6985698137710 
  clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
b2db91ca14bbd17fef5ce85839679144fff3f689 
  clients/src/main/java/org/apache/kafka/common/errors/InterruptException.java 
fee322fa0dd9704374db4a6964246a7d2918d3e4 
  clients/src/main/java/org/apache/kafka/common/serialization/Serializer.java 
16a67a2a5d2a62dd933c53749221e19c5019524b 
  
clients/src/test/java/org/apache/kafka/clients/producer/internals/RecordAccumulatorTest.java
 baa48e7c1b7ac5da8f3aca29f653c3fff88f8009 
  core/src/test/scala/integration/kafka/api/ProducerSendTest.scala 
9811a2b2b1e9bf1beb301138f7626e12d275a8db 

Diff: https://reviews.apache.org/r/31850/diff/


Testing
---

Unit tests passed.


Thanks,

Jiangjie Qin



[jira] [Updated] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Olson,Andrew (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olson,Andrew updated KAFKA-2189:

Component/s: compression
 build

> Snappy compression of message batches less efficient in 0.8.2.1
> ---
>
> Key: KAFKA-2189
> URL: https://issues.apache.org/jira/browse/KAFKA-2189
> Project: Kafka
>  Issue Type: Bug
>  Components: build, compression, log
>Affects Versions: 0.8.2.1
>Reporter: Olson,Andrew
>Assignee: Jay Kreps
>
> We are using snappy compression and noticed a fairly substantial increase 
> (about 2.25x) in log filesystem space consumption after upgrading a Kafka 
> cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages 
> being seemingly recompressed individually (or possibly with a much smaller 
> buffer or dictionary?) instead of as a batch as sent by producers. We 
> eventually tracked down the change in compression ratio/scope to this [1] 
> commit that updated the snappy version from 1.0.5 to 1.1.1.3. The Kafka 
> client version does not appear to be relevant as we can reproduce this with 
> both the 0.8.1.1 and 0.8.2.1 Producer.
> Here are the log files from our troubleshooting that contain the same set of 
> 1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
> commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.
> {noformat}
> -rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
> /var/kafka2/f9d9b-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
> /var/kafka2/f9d9b-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
> /var/kafka2/f9d9b-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
> /var/kafka2/f9d9b-batch-1000-0/.log
> -rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
> /var/kafka2/f5ab8-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
> /var/kafka2/f5ab8-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
> /var/kafka2/f5ab8-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
> /var/kafka2/f5ab8-batch-1000-0/.log
> {noformat}
> [1] 
> https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2176) DefaultPartitioner doesn't perform consistent hashing based on

2015-05-12 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540784#comment-14540784
 ] 

Jiangjie Qin commented on KAFKA-2176:
-

[~i_maravic], LinkedIn uses the mirror maker on trunk, so we are already using 
the new producer. I think KIP-22 probably will solve your issue. We are fixing 
some issues that have been identified in the new producer as well as adding 
some things to it. As [~gwenshap] said, I think it is probably better to fix 
forward on the new producer instead of sticking to the old producer.

> DefaultPartitioner doesn't perform consistent hashing based on 
> ---
>
> Key: KAFKA-2176
> URL: https://issues.apache.org/jira/browse/KAFKA-2176
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
>Reporter: Igor Maravić
>  Labels: easyfix, newbie
> Fix For: 0.8.1
>
> Attachments: KAFKA-2176.patch
>
>
> While deploying MirrorMakers in production, we configured it to use 
> kafka.producer.DefaultPartitioner. By doing this and since we had the same 
> amount partitions for the topic in local and aggregation cluster, we expect 
> that the messages will be partitioned the same way.
> This wasn't the case. Messages were properly partitioned with 
> DefaultPartitioner on our local cluster, since the key was of the type String.
> On the MirrorMaker side, the messages were not properly partitioned.
> Problem is that the Array[Byte] doesn't implement hashCode function, since it 
> is mutable collection.
> Fix is to calculate the deep hash code if the key is of Array type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request: Corrected the Changes in ZkUtils.scala - KAFKA...

2015-05-12 Thread nssalian
Github user nssalian closed the pull request at:

https://github.com/apache/kafka/pull/63


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-2167) ZkUtils updateEphemeralPath JavaDoc (spelling and correctness)

2015-05-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540782#comment-14540782
 ] 

ASF GitHub Bot commented on KAFKA-2167:
---

Github user nssalian closed the pull request at:

https://github.com/apache/kafka/pull/63


> ZkUtils updateEphemeralPath JavaDoc (spelling and correctness)
> --
>
> Key: KAFKA-2167
> URL: https://issues.apache.org/jira/browse/KAFKA-2167
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jon Bringhurst
>Assignee: Neelesh Srinivas Salian
>  Labels: newbie
>
> I'm not 100% sure on this, but it seems like "persistent" should instead say 
> "ephemeral" in the JavaDoc. Also, note that "parrent" is misspelled.
> {noformat}
>   /**
>* Update the value of a persistent node with the given path and data.
>* create parrent directory if necessary. Never throw NodeExistException.
>*/
>   def updateEphemeralPath(client: ZkClient, path: String, data: String): Unit 
> = {
> try {
>   client.writeData(path, data)
> }
> catch {
>   case e: ZkNoNodeException => {
> createParentPath(client, path)
> client.createEphemeral(path, data)
>   }
>   case e2 => throw e2
> }
>   }
> {noformat}
> should be:
> {noformat}
>   /**
>* Update the value of an ephemeral node with the given path and data.
>* create parent directory if necessary. Never throw NodeExistException.
>*/
>   def updateEphemeralPath(client: ZkClient, path: String, data: String): Unit 
> = {
> try {
>   client.writeData(path, data)
> }
> catch {
>   case e: ZkNoNodeException => {
> createParentPath(client, path)
> client.createEphemeral(path, data)
>   }
>   case e2 => throw e2
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request: Kafka 2167

2015-05-12 Thread nssalian
GitHub user nssalian opened a pull request:

https://github.com/apache/kafka/pull/64

Kafka 2167

Changed ZkUtils.scala for KAFKA-2167

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nssalian/kafka KAFKA-2167

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/64.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #64


commit 904af08e2437d2ddf57017a3f3f1decc2087c491
Author: Neelesh Srinivas Salian 
Date:   2015-05-12T05:17:42Z

KAFKA-2167

commit aa86ba400eab8d1511418ef7fea5f3d06db03b18
Author: Neelesh Srinivas Salian 
Date:   2015-05-12T21:09:01Z

Revert "KAFKA-2167"

This reverts commit 904af08e2437d2ddf57017a3f3f1decc2087c491.

commit 61cbb120e12c60fc952b2e2fdb96db5111e1551a
Author: Neelesh Srinivas Salian 
Date:   2015-05-12T21:20:06Z

Changed ZkUtils.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-2167) ZkUtils updateEphemeralPath JavaDoc (spelling and correctness)

2015-05-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540779#comment-14540779
 ] 

ASF GitHub Bot commented on KAFKA-2167:
---

GitHub user nssalian opened a pull request:

https://github.com/apache/kafka/pull/64

Kafka 2167

Changed ZkUtils.scala for KAFKA-2167

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nssalian/kafka KAFKA-2167

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/64.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #64


commit 904af08e2437d2ddf57017a3f3f1decc2087c491
Author: Neelesh Srinivas Salian 
Date:   2015-05-12T05:17:42Z

KAFKA-2167

commit aa86ba400eab8d1511418ef7fea5f3d06db03b18
Author: Neelesh Srinivas Salian 
Date:   2015-05-12T21:09:01Z

Revert "KAFKA-2167"

This reverts commit 904af08e2437d2ddf57017a3f3f1decc2087c491.

commit 61cbb120e12c60fc952b2e2fdb96db5111e1551a
Author: Neelesh Srinivas Salian 
Date:   2015-05-12T21:20:06Z

Changed ZkUtils.scala




> ZkUtils updateEphemeralPath JavaDoc (spelling and correctness)
> --
>
> Key: KAFKA-2167
> URL: https://issues.apache.org/jira/browse/KAFKA-2167
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jon Bringhurst
>Assignee: Neelesh Srinivas Salian
>  Labels: newbie
>
> I'm not 100% sure on this, but it seems like "persistent" should instead say 
> "ephemeral" in the JavaDoc. Also, note that "parrent" is misspelled.
> {noformat}
>   /**
>* Update the value of a persistent node with the given path and data.
>* create parrent directory if necessary. Never throw NodeExistException.
>*/
>   def updateEphemeralPath(client: ZkClient, path: String, data: String): Unit 
> = {
> try {
>   client.writeData(path, data)
> }
> catch {
>   case e: ZkNoNodeException => {
> createParentPath(client, path)
> client.createEphemeral(path, data)
>   }
>   case e2 => throw e2
> }
>   }
> {noformat}
> should be:
> {noformat}
>   /**
>* Update the value of an ephemeral node with the given path and data.
>* create parent directory if necessary. Never throw NodeExistException.
>*/
>   def updateEphemeralPath(client: ZkClient, path: String, data: String): Unit 
> = {
> try {
>   client.writeData(path, data)
> }
> catch {
>   case e: ZkNoNodeException => {
> createParentPath(client, path)
> client.createEphemeral(path, data)
>   }
>   case e2 => throw e2
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [KIP-DISCUSSION] KIP-22 Expose a Partitioner interface in the new producer

2015-05-12 Thread Jiangjie Qin
Hey Harsha,

It looks you created the KIP page at wrong place. . . Can you move the
page to a child page of
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposa
ls

Thanks.

Jiangjie (Becket) Qin

On 5/6/15, 6:12 PM, "Harsha"  wrote:

>Thanks for the review Joel. I agree don't need a init method we can use
>configure. I'll update the KIP.
>-Harsha
>
>On Wed, May 6, 2015, at 04:45 PM, Joel Koshy wrote:
>> +1 with a minor comment: do we need an init method given it extends
>> Configurable?
>> 
>> Also, can you move this wiki out of drafts and add it to the table in
>> 
>>https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Propo
>>sals?
>> 
>> Thanks,
>> 
>> Joel
>> 
>> On Wed, May 06, 2015 at 07:46:46AM -0700, Sriharsha Chintalapani wrote:
>> > Thanks Jay. I removed partitioner.metadata from KIP. I¹ll send an
>>updated patch.
>> > 
>> > -- 
>> > Harsha
>> > Sent with Airmail
>> > 
>> > On May 5, 2015 at 6:31:47 AM, Sriharsha Chintalapani
>>(harsh...@fastmail.fm) wrote:
>> > 
>> > Thanks for the comments everyone.
>> > Hi Jay,
>> >  I do have a question regarding configurable interface on how to
>>pass a Map properties. I couldn¹t find any other classes
>>using it. JMX reporter overrides it but doesn¹t implement it.  So with
>>configurable partitioner how can a user pass in partitioner
>>configuration since its getting instantiated within the producer.
>> > 
>> > Thanks,
>> > Harsha
>> > 
>> > 
>> > On May 4, 2015 at 10:36:45 AM, Jay Kreps (jay.kr...@gmail.com) wrote:
>> > 
>> > Hey Harsha,
>> > 
>> > That proposal sounds good. One minor thing--I don't think we need to
>>have
>> > the partitioner.metadata property. Our reason for using string
>>properties
>> > is exactly to make config extensible at runtime. So a given
>>partitioner can
>> > add whatever properties make sense using the configure() api it
>>defines.
>> > 
>> > -Jay
>> > 
>> > On Sun, May 3, 2015 at 5:57 PM, Harsha  wrote:
>> > 
>> > > Thanks Jay & Gianmarco for the comments. I picked the option A, if
>>user
>> > > sends a partition id than it will applied and partitioner.class
>>method
>> > > will only called if partition id is null .
>> > > Please take a look at the updated KIP here
>> > >
>> > > 
>>https://cwiki.apache.org/confluence/display/KAFKA/KIP-+22+-+Expose+a+Part
>>itioner+interface+in+the+new+producer
>> > > . Let me know if you see anything missing.
>> > >
>> > > Thanks,
>> > > Harsha
>> > >
>> > > On Fri, Apr 24, 2015, at 02:15 AM, Gianmarco De Francisci Morales
>>wrote:
>> > > > Hi,
>> > > >
>> > > >
>> > > > Here are the questions I think we should consider:
>> > > > > 1. Do we need this at all given that we have the partition
>>argument in
>> > > > > ProducerRecord which gives full control? I think we do need it
>>because
>> > > this
>> > > > > is a way to plug in a different partitioning strategy at run
>>time and
>> > > do it
>> > > > > in a fairly transparent way.
>> > > > >
>> > > >
>> > > > Yes, we need it if we want to support different partitioning
>>strategies
>> > > > inside Kafka rather than requiring the user to code them
>>externally.
>> > > >
>> > > >
>> > > > > 3. Do we need to add the value? I suspect people will have uses
>>for
>> > > > > computing something off a few fields in the value to choose the
>> > > partition.
>> > > > > This would be useful in cases where the key was being used for
>>log
>> > > > > compaction purposes and did not contain the full information for
>> > > computing
>> > > > > the partition.
>> > > > >
>> > > >
>> > > > I am not entirely sure about this. I guess that most partitioners
>>should
>> > > > not use it.
>> > > > I think it makes it easier to reason about the system if the
>>partitioner
>> > > > only works on the key.
>> > > > Hoever, if the value (and its serialization) are already
>>available, there
>> > > > is not much harm in passing them along.
>> > > >
>> > > >
>> > > > > 4. This interface doesn't include either an init() or close()
>>method.
>> > > It
>> > > > > should implement Closable and Configurable, right?
>> > > > >
>> > > >
>> > > > Right now the only application I can think of to have an init()
>>and
>> > > > close()
>> > > > is to read some state information (e.g., load information) that is
>> > > > published on some external distributed storage (e.g., zookeeper)
>>by the
>> > > > brokers.
>> > > > It might be useful also for reconfiguration and state migration.
>> > > >
>> > > > I think it's not a very common use case right now, but if the
>>added
>> > > > complexity is not too much it might be worth to have support for
>>these
>> > > > methods.
>> > > >
>> > > >
>> > > >
>> > > > > 5. What happens if the user both sets the partition id in the
>> > > > > ProducerRecord and sets a partitioner? Does the partition id
>>just get
>> > > > > passed in to the partitioner (as sort of implied in this
>>interface?).
>> > > This
>> > > > > is a bit weird since if you pass in the partition id you kind of
>> > > expect it
>> > > > >

Re: Review Request 33049: Patch for KAFKA-2084

2015-05-12 Thread Aditya Auradkar


> On May 12, 2015, 7:38 p.m., Dong Lin wrote:
> > clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java, line 117
> > 
> >
> > metric.value(timeMs), which translates to Rate.measure(config, timeMs), 
> > may return Infinity. This may introduce infinite delay.
> > 
> > I think this bug is rooted either in class Rate or  class SampledStat. 
> > In short, SampledStat.purgeObsoleteSamples will remove all Samples, such 
> > that SampledStat.oldest(now) == now.
> > 
> > Should I open a ticket and submit a patch for it?

Hey dong, yeah you should submit a patch for it.


> On May 12, 2015, 7:38 p.m., Dong Lin wrote:
> > clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java, lines 
> > 122-127
> > 
> >
> > The function quits on the first quota violaation, and calculate 
> > delayTime based on this violation. 
> > 
> > Should we enumerate all metric in this.metrics() and set delayTime of 
> > the QuotaViolationException to be the largest delayTime of all violatoins?

Also a good point. Perhaps you can fix it in the patch you will submit? For 
quotas, we effectively have only one quota per sensor, so it's ok (for now).


- Aditya


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33049/#review83441
---


On May 11, 2015, 11:17 p.m., Aditya Auradkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33049/
> ---
> 
> (Updated May 11, 2015, 11:17 p.m.)
> 
> 
> Review request for kafka, Joel Koshy and Jun Rao.
> 
> 
> Bugs: KAFKA-2084
> https://issues.apache.org/jira/browse/KAFKA-2084
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Updated patch for quotas. This patch does the following: 
> 1. Add per-client metrics for both producer and consumers 
> 2. Add configuration for quotas 
> 3. Compute delay times in the metrics package and return the delay times in 
> QuotaViolationException 
> 4. Add a DelayQueue in KafkaApi's that can be used to throttle any type of 
> request. Implemented request throttling for produce and fetch requests. 
> 5. Added unit and integration test cases.
> 6. This doesn't include a system test. There is a separate ticket for that
> 
> 
> Diffs
> -
> 
>   clients/src/main/java/org/apache/kafka/common/metrics/MetricConfig.java 
> dfa1b0a11042ad9d127226f0e0cec8b1d42b8441 
>   clients/src/main/java/org/apache/kafka/common/metrics/Quota.java 
> d82bb0c055e631425bc1ebbc7d387baac76aeeaa 
>   
> clients/src/main/java/org/apache/kafka/common/metrics/QuotaViolationException.java
>  a451e5385c9eca76b38b425e8ac856b2715fcffe 
>   clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java 
> ca823fd4639523018311b814fde69b6177e73b97 
>   clients/src/test/java/org/apache/kafka/common/utils/MockTime.java  
>   core/src/main/scala/kafka/server/ClientQuotaMetrics.scala PRE-CREATION 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/KafkaConfig.scala 
> 9efa15ca5567b295ab412ee9eea7c03eb4cdc18b 
>   core/src/main/scala/kafka/server/KafkaServer.scala 
> b7d2a2842e17411a823b93bdedc84657cbd62be1 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/main/scala/kafka/server/ThrottledRequest.scala PRE-CREATION 
>   core/src/main/scala/kafka/utils/ShutdownableThread.scala 
> fc226c863095b7761290292cd8755cd7ad0f155c 
>   core/src/test/scala/integration/kafka/api/QuotasTest.scala PRE-CREATION 
>   core/src/test/scala/unit/kafka/server/ClientQuotaMetricsTest.scala 
> PRE-CREATION 
>   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
> 8014a5a6c362785539f24eb03d77278434614fe6 
>   core/src/test/scala/unit/kafka/server/ThrottledRequestExpirationTest.scala 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/33049/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Aditya Auradkar
> 
>



[jira] [Commented] (KAFKA-2186) Follow-up patch of KAFKA-1650

2015-05-12 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540706#comment-14540706
 ] 

Joel Koshy commented on KAFKA-2186:
---

Hmm.. though this is trivial and an omission from KAFKA-1650, this is an API 
change. Probably not KIP-worthy but maybe warrants some discussion on the open 
source list?

> Follow-up patch of KAFKA-1650
> -
>
> Key: KAFKA-2186
> URL: https://issues.apache.org/jira/browse/KAFKA-2186
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Attachments: KAFKA-2186.patch
>
>
> Offsets commit with a map was added in KAFKA-1650. It should be added to 
> consumer connector java API also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 34070: Patch for KAFKA-2186

2015-05-12 Thread Joel Koshy


> On May 12, 2015, 2:31 a.m., Aditya Auradkar wrote:
> > core/src/main/scala/kafka/javaapi/consumer/ConsumerConnector.java, line 73
> > 
> >
> > How does this work if the consumer doesn't own these partitions? Is it 
> > possible to commit offsets for any topic? Just curious..

Yes - it is possible.


- Joel


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34070/#review83343
---


On May 12, 2015, 1:39 a.m., Jiangjie Qin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34070/
> ---
> 
> (Updated May 12, 2015, 1:39 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2186
> https://issues.apache.org/jira/browse/KAFKA-2186
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Patch for KAFKA-2186 follow-up patch of KAFKA-1650, add the missing offset 
> commit with map in java api
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/javaapi/consumer/ConsumerConnector.java 
> cc3400ff81fc0db69b5129ad7b440f20a211a79d 
> 
> Diff: https://reviews.apache.org/r/34070/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jiangjie Qin
> 
>



Re: Review Request 34070: Patch for KAFKA-2186

2015-05-12 Thread Joel Koshy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34070/#review83472
---

Ship it!


Ship It!

- Joel Koshy


On May 12, 2015, 1:39 a.m., Jiangjie Qin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34070/
> ---
> 
> (Updated May 12, 2015, 1:39 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2186
> https://issues.apache.org/jira/browse/KAFKA-2186
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Patch for KAFKA-2186 follow-up patch of KAFKA-1650, add the missing offset 
> commit with map in java api
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/javaapi/consumer/ConsumerConnector.java 
> cc3400ff81fc0db69b5129ad7b440f20a211a79d 
> 
> Diff: https://reviews.apache.org/r/34070/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jiangjie Qin
> 
>



Re: [DISCUSS] KIP-19 Add a request timeout to NetworkClient

2015-05-12 Thread Jiangjie Qin
Hey Ewen, 

Very good summary about the compatibility. What you proposed makes sense.
So basically we can do the following:

In next release, i.e. 0.8.3:
1. Add REPLICATION_TIMEOUT_CONFIG (“replication.timeout.ms”)
2. Mark TIMEOUT_CONFIG as deprecated
3. Override REPLICATION_TIMEOUT_CONFIG with TIMEOUT_CONFIG if it is
defined and give a warning about deprecation.
In the release after 0.8.3, we remove TIMEOUT_CONFIG.

This should give enough buffer for this change.

Request timeout is a complete new thing we add to fix a bug, I’m with you
it does not make sense to have it maintain the old buggy behavior. So we
can set it to a reasonable value instead of infinite.

Jiangjie (Becket) Qin

On 5/12/15, 12:03 PM, "Ewen Cheslack-Postava"  wrote:

>I think my confusion is coming from this:
>
>> So in this KIP, we only address (3). The only public interface change
>>is a
>> new configuration of request timeout (and maybe change the configuration
>> name of TIMEOUT_CONFIG to REPLICATION_TIMEOUT_CONFIG).
>
>There are 3 possible compatibility issues I see here:
>
>* I assumed this meant the constants also change, so "timeout.ms" becomes
>"
>replication.timeout.ms". This breaks config files that worked on the
>previous version and the only warning would be in release notes. We do
>warn
>about unused configs so they might notice the problem.
>
>* Binary and source compatibility if someone configures their client in
>code and uses the TIMEOUT_CONFIG variable. Renaming it will cause existing
>jars to break if you try to run against an updated client (which seems not
>very significant since I doubt people upgrade these without recompiling
>but
>maybe I'm wrong about that). And it breaks builds without have deprecated
>that field first, which again, is probably not the biggest issue but is
>annoying for users and when we accidentally changed the API we received a
>complaint about breaking builds.
>
>* Behavior compatibility as Jay mentioned on the call -- setting the
>config
>(even if the name changed) doesn't have the same effect it used to.
>
>One solution, which admittedly is more painful to implement and maintain,
>would be to maintain the timeout.ms config, have it override the others if
>it is specified (including an infinite request timeout I guess?), and if
>it
>isn't specified, we can just use the new config variables. Given a real
>deprecation schedule, users would have better warning of changes and a
>window to make the changes.
>
>I actually think it might not be necessary to maintain the old behavior
>precisely, although maybe for some code it is an issue if they start
>seeing
>timeout exceptions that they wouldn't have seen before?
>
>-Ewen
>
>On Wed, May 6, 2015 at 6:06 PM, Jun Rao  wrote:
>
>> Jiangjie,
>>
>> Yes, I think using metadata timeout to expire batches in the record
>> accumulator makes sense.
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, May 4, 2015 at 10:32 AM, Jiangjie Qin
>>
>> wrote:
>>
>> > I incorporated Ewen and Guozhang’s comments in the KIP page. Want to
>> speed
>> > up on this KIP because currently we experience mirror-maker hung very
>> > likely when a broker is down.
>> >
>> > I also took a shot to solve KAFKA-1788 in KAFKA-2142. I used metadata
>> > timeout to expire the batches which are sitting in accumulator without
>> > leader info. I did that because the situation there is essentially
>> missing
>> > metadata.
>> >
>> > As a summary of what I am thinking about the timeout in new Producer:
>> >
>> > 1. Metadata timeout:
>> >   - used in send(), blocking
>> >   - used in accumulator to expire batches with timeout exception.
>> > 2. Linger.ms
>> >   - Used in accumulator to ready the batch for drain
>> > 3. Request timeout
>> >   - Used in NetworkClient to expire a batch and retry if no response
>>is
>> > received for a request before timeout.
>> >
>> > So in this KIP, we only address (3). The only public interface change
>>is
>> a
>> > new configuration of request timeout (and maybe change the
>>configuration
>> > name of TIMEOUT_CONFIG to REPLICATION_TIMEOUT_CONFIG).
>> >
>> > Would like to see what people think of above approach?
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On 4/20/15, 6:02 PM, "Jiangjie Qin"  wrote:
>> >
>> > >Jun,
>> > >
>> > >I thought a little bit differently on this.
>> > >Intuitively, I am thinking that if a partition is offline, the
>>metadata
>> > >for that partition should be considered not ready because we don’t
>>know
>> > >which broker we should send the message to. So those sends need to be
>> > >blocked on metadata timeout.
>> > >Another thing I’m wondering is in which scenario an offline partition
>> will
>> > >become online again in a short period of time and how likely it will
>> > >occur. My understanding is that the batch timeout for batches
>>sitting in
>> > >accumulator should be larger than linger.ms but should not be too
>>long
>> > >(e.g. less than 60 seconds). Otherwise it will exhaust the shared
>>buffer
>> > >with batches to be aborted.
>> > >
>>

[jira] [Comment Edited] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Evan Huus (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540669#comment-14540669
 ] 

Evan Huus edited comment on KAFKA-2147 at 5/12/15 8:43 PM:
---

1. 150-200 based on a quick wireshark capture

2. Thousands, we had full debug logging enabled for this period. I'm not sure 
what might be relevant. At a guess, I am seeing lots of the following three 
lines, across many different topic/partitions:
{noformat}
[2015-04-30 15:54:13,999] DEBUG [Replica Manager on Broker 1]: Request key 
[admin,4] unblocked 0 producer requests. (kafka.server.ReplicaManager)
[2015-04-30 15:54:13,999] DEBUG [Replica Manager on Broker 1]: Recorded 
follower 3 position 332616 for partition [admin,4]. 
(kafka.server.ReplicaManager)
[2015-04-30 15:54:13,999] DEBUG Partition [admin,4] on broker 1: Skipping 
update high watermark since Old hw 332616 [0 : 337335616] is larger than new hw 
332616 [0 : 337335616] for partition [admin,4]. All leo's are 332616 [0 : 
337335616] (kafka.cluster.Partition)
{noformat}

3. No ZK problems I can spot.


was (Author: eapache):
1. 150-200 base on a quick wireshark capture

2. Thousands, we had full debug logging enabled for this period. I'm not sure 
what might be relevant. At a guess, I am seeing lots of the following three 
lines, across many different topic/partitions:
{noformat}
[2015-04-30 15:54:13,999] DEBUG [Replica Manager on Broker 1]: Request key 
[admin,4] unblocked 0 producer requests. (kafka.server.ReplicaManager)
[2015-04-30 15:54:13,999] DEBUG [Replica Manager on Broker 1]: Recorded 
follower 3 position 332616 for partition [admin,4]. 
(kafka.server.ReplicaManager)
[2015-04-30 15:54:13,999] DEBUG Partition [admin,4] on broker 1: Skipping 
update high watermark since Old hw 332616 [0 : 337335616] is larger than new hw 
332616 [0 : 337335616] for partition [admin,4]. All leo's are 332616 [0 : 
337335616] (kafka.cluster.Partition)
{noformat}

3. No ZK problems I can spot.

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 h

[jira] [Commented] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Evan Huus (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540669#comment-14540669
 ] 

Evan Huus commented on KAFKA-2147:
--

1. 150-200 base on a quick wireshark capture

2. Thousands, we had full debug logging enabled for this period. I'm not sure 
what might be relevant. At a guess, I am seeing lots of the following three 
lines, across many different topic/partitions:
{noformat}
[2015-04-30 15:54:13,999] DEBUG [Replica Manager on Broker 1]: Request key 
[admin,4] unblocked 0 producer requests. (kafka.server.ReplicaManager)
[2015-04-30 15:54:13,999] DEBUG [Replica Manager on Broker 1]: Recorded 
follower 3 position 332616 for partition [admin,4]. 
(kafka.server.ReplicaManager)
[2015-04-30 15:54:13,999] DEBUG Partition [admin,4] on broker 1: Skipping 
update high watermark since Old hw 332616 [0 : 337335616] is larger than new hw 
332616 [0 : 337335616] for partition [admin,4]. All leo's are 332616 [0 : 
337335616] (kafka.cluster.Partition)
{noformat}

3. No ZK problems I can spot.

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume 

[jira] [Commented] (KAFKA-2150) FetcherThread backoff need to grab lock before wait on condition.

2015-05-12 Thread Sriharsha Chintalapani (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540651#comment-14540651
 ] 

Sriharsha Chintalapani commented on KAFKA-2150:
---

Thanks for merging it [~guozhang] and sorry about sending bad patch :).

> FetcherThread backoff need to grab lock before wait on condition.
> -
>
> Key: KAFKA-2150
> URL: https://issues.apache.org/jira/browse/KAFKA-2150
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jiangjie Qin
>Assignee: Sriharsha Chintalapani
>  Labels: newbie++
> Attachments: KAFKA-2150.patch, KAFKA-2150_2015-04-25_13:14:05.patch, 
> KAFKA-2150_2015-04-25_13:18:35.patch, KAFKA-2150_2015-04-25_13:35:36.patch
>
>
> Saw the following error: 
> kafka.api.ProducerBounceTest > testBrokerFailure STANDARD_OUT
> [2015-04-25 00:40:43,997] ERROR [ReplicaFetcherThread-0-0], Error due to  
> (kafka.server.ReplicaFetcherThread:103)
> java.lang.IllegalMonitorStateException
>   at 
> java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:127)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1239)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.fullyRelease(AbstractQueuedSynchronizer.java:1668)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2107)
>   at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:95)
>   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> [2015-04-25 00:40:47,064] ERROR [ReplicaFetcherThread-0-1], Error due to  
> (kafka.server.ReplicaFetcherThread:103)
> java.lang.IllegalMonitorStateException
>   at 
> java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:127)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1239)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.fullyRelease(AbstractQueuedSynchronizer.java:1668)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2107)
>   at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:95)
>   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> We should grab the lock before waiting on the condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 31850: Patch for KAFKA-1660

2015-05-12 Thread Joel Koshy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31850/#review83433
---

Ship it!


Minor comments - I will address these on check-in.


clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java


before marking it as done.



core/src/test/scala/integration/kafka/api/ProducerSendTest.scala


This tests idempotence of the close call.



core/src/test/scala/integration/kafka/api/ProducerSendTest.scala


`100 -> %d, (i + 1) * numRecords`


- Joel Koshy


On April 30, 2015, 12:37 a.m., Jiangjie Qin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31850/
> ---
> 
> (Updated April 30, 2015, 12:37 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1660
> https://issues.apache.org/jira/browse/KAFKA-1660
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> A minor fix.
> 
> 
> Incorporated Guozhang's comments.
> 
> 
> Modify according to the latest conclusion.
> 
> 
> Patch for the finally passed KIP-15git status
> 
> 
> Addressed Joel and Guozhang's comments.
> 
> 
> rebased on trunk
> 
> 
> Rebase on trunk
> 
> 
> Addressed Joel's comments.
> 
> 
> Addressed Joel's comments
> 
> 
> Addressed Jay's comments
> 
> 
> Change java doc as Jay suggested.
> 
> 
> Go back to the AtomicInteger approach for less dependency.
> 
> 
> Rebased on trunk
> 
> 
> Add some missing code from rebase.
> 
> 
> Diffs
> -
> 
>   clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java 
> 42b12928781463b56fc4a45d96bb4da2745b6d95 
>   clients/src/main/java/org/apache/kafka/clients/producer/MockProducer.java 
> 6913090af03a455452b0b5c3df78f266126b3854 
>   clients/src/main/java/org/apache/kafka/clients/producer/Producer.java 
> 5b3e75ed83a940bc922f9eca10d4008d67aa37c9 
>   
> clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java
>  49a98838767615dd952da20825f6985698137710 
>   
> clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
> b2db91ca14bbd17fef5ce85839679144fff3f689 
>   
> clients/src/main/java/org/apache/kafka/common/errors/InterruptException.java 
> fee322fa0dd9704374db4a6964246a7d2918d3e4 
>   clients/src/main/java/org/apache/kafka/common/serialization/Serializer.java 
> c44054038066f0d0829d05f082b2ee42b34cded7 
>   
> clients/src/test/java/org/apache/kafka/clients/producer/internals/RecordAccumulatorTest.java
>  baa48e7c1b7ac5da8f3aca29f653c3fff88f8009 
>   core/src/test/scala/integration/kafka/api/ProducerSendTest.scala 
> 9811a2b2b1e9bf1beb301138f7626e12d275a8db 
> 
> Diff: https://reviews.apache.org/r/31850/diff/
> 
> 
> Testing
> ---
> 
> Unit tests passed.
> 
> 
> Thanks,
> 
> Jiangjie Qin
> 
>



Re: [DISCUSS] KIP-21 Configuration Management

2015-05-12 Thread Joel Koshy
The lack of audit could be addressed to some degree by an internal
__config_changes topic which can have very long retention. Also, per
the hangout summary that Gwen sent out it appears that we decided
against supporting SIGHUP/dynamic configs for the broker.

Thanks,

Joel

On Tue, May 12, 2015 at 11:05:06AM -0700, Neha Narkhede wrote:
> Thanks for chiming in, Todd!
> 
> Agree that the lack of audit and rollback is a major downside of moving all
> configs to ZooKeeper. Being able to configure dynamically created entities
> in Kafka is required though. So I think what Todd suggested is a good
> solution to managing all configs - catching SIGHUP for broker configs and
> storing dynamic configs in ZK like we do today.
> 
> On Tue, May 12, 2015 at 10:30 AM, Jay Kreps  wrote:
> 
> > Hmm, here is how I think we can change the split brain proposal to make it
> > a bit better:
> > 1. Get rid of broker overrides, this is just done in the config file. This
> > makes the precedence chain a lot clearer (e.g. zk always overrides file on
> > a per-entity basis).
> > 2. Get rid of the notion of dynamic configs in ConfigDef and in the broker.
> > All overrides are dynamic and all server configs are static.
> > 3. Create an equivalent of LogConfig for ClientConfig and any future config
> > type we make.
> > 4. Generalize the TopicConfigManager to handle multiple types of overrides.
> >
> > What we haven't done is try to think through how the pure zk approach would
> > work.
> >
> > -Jay
> >
> >
> >
> > On Mon, May 11, 2015 at 10:53 PM, Ashish Singh 
> > wrote:
> >
> > > I agree with the Joel's suggestion on keeping broker's configs in
> > > config file and clients/topics config in ZK. Few other projects, Apache
> > > Solr for one, also does something similar for its configurations.
> > >
> > > On Monday, May 11, 2015, Gwen Shapira  wrote:
> > >
> > > > I like this approach (obviously).
> > > > I am also OK with supporting broker re-read of config file based on ZK
> > > > watch instead of SIGHUP, if we see this as more consistent with the
> > rest
> > > of
> > > > our code base.
> > > >
> > > > Either is fine by me as long as brokers keep the file and just do
> > refresh
> > > > :)
> > > >
> > > > On Tue, May 12, 2015 at 2:54 AM, Joel Koshy  > > > > wrote:
> > > >
> > > > > So the general concern here is the dichotomy of configs (which we
> > > > > already have - i.e., in the form of broker config files vs topic
> > > > > configs in zookeeper). We (at LinkedIn) had some discussions on this
> > > > > last week and had this very question for the operations team whose
> > > > > opinion is I think to a large degree a touchstone for this decision:
> > > > > "Has the operations team at LinkedIn experienced any pain so far with
> > > > > managing topic configs in ZooKeeper (while broker configs are
> > > > > file-based)?" It turns out that ops overwhelmingly favors the current
> > > > > approach. i.e., service configs as file-based configs and
> > client/topic
> > > > > configs in ZooKeeper is intuitive and works great. This may be
> > > > > somewhat counter-intuitive to devs, but this is one of those
> > decisions
> > > > > for which ops input is very critical - because for all practical
> > > > > purposes, they are the users in this discussion.
> > > > >
> > > > > If we continue with this dichotomy and need to support dynamic config
> > > > > for client/topic configs as well as select service configs then there
> > > > > will need to be dichotomy in the config change mechanism as well.
> > > > > i.e., client/topic configs will change via (say) a ZooKeeper watch
> > and
> > > > > the service config will change via a config file re-read (on SIGHUP)
> > > > > after config changes have been pushed out to local files. Is this a
> > > > > bad thing? Personally, I don't think it is - i.e. I'm in favor of
> > this
> > > > > approach. What do others think?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Joel
> > > > >
> > > > > On Mon, May 11, 2015 at 11:08:44PM +0300, Gwen Shapira wrote:
> > > > > > What Todd said :)
> > > > > >
> > > > > > (I think my ops background is showing...)
> > > > > >
> > > > > > On Mon, May 11, 2015 at 10:17 PM, Todd Palino  > > > > wrote:
> > > > > >
> > > > > > > I understand your point here, Jay, but I disagree that we can't
> > > have
> > > > > two
> > > > > > > configuration systems. We have two different types of
> > configuration
> > > > > > > information. We have configuration that relates to the service
> > > itself
> > > > > (the
> > > > > > > Kafka broker), and we have configuration that relates to the
> > > content
> > > > > within
> > > > > > > the service (topics). I would put the client configuration
> > (quotas)
> > > > in
> > > > > the
> > > > > > > with the second part, as it is dynamic information. I just don't
> > > see
> > > > a
> > > > > good
> > > > > > > argument for effectively degrading the configuration for the
> > > service
> > > > > > > because of trying to keep it paired with the

[jira] [Commented] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540596#comment-14540596
 ] 

Jun Rao commented on KAFKA-2147:


1. How many producer requests/sec are there on each broker?

2. In the 30 sec gap, were there any other entries in the log?

4. Also, do you see any ZK session expiration in the log?


> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we developed a hypothesis (detailed below) which guided our 
> subsequent debugging tests:
> - Setting a daemon up to produce regular random data to all of the topics led 
> by kafka06 (specifically the ones which otherwise would receive no data) 
> substantially alleviated the problem.
> - Doing an additional rebalance of the cluster in order to move a number of 
> other topics with regular data to kafka06 appears to have solved the problem 
> completely.
> h4. Hypothesis
> Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
> not backing off correctly

Re: Review Request 33378: Patch for KAFKA-2136

2015-05-12 Thread Dong Lin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33378/#review83439
---



core/src/main/scala/kafka/server/ReplicaManager.scala


Will readFromLocalLog() read data into memory before the cilent's quota is 
checked?


- Dong Lin


On May 11, 2015, 9:51 p.m., Aditya Auradkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33378/
> ---
> 
> (Updated May 11, 2015, 9:51 p.m.)
> 
> 
> Review request for kafka, Joel Koshy and Jun Rao.
> 
> 
> Bugs: KAFKA-2136
> https://issues.apache.org/jira/browse/KAFKA-2136
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Fixing bug
> 
> 
> Diffs
> -
> 
>   
> clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java
>  ef9dd5238fbc771496029866ece1d85db6d7b7a5 
>   
> clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
> b2db91ca14bbd17fef5ce85839679144fff3f689 
>   clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java 
> 3dc8b015afd2347a41c9a9dbc02b8e367da5f75f 
>   clients/src/main/java/org/apache/kafka/common/requests/FetchRequest.java 
> 8686d83aa52e435c6adafbe9ff4bd1602281072a 
>   clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java 
> eb8951fba48c335095cc43fc3672de1c733e07ff 
>   clients/src/main/java/org/apache/kafka/common/requests/ProduceRequest.java 
> fabeae3083a8ea55cdacbb9568f3847ccd85bab4 
>   clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java 
> 37ec0b79beafcf5735c386b066eb319fb697eff5 
>   
> clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
>  419541011d652becf0cda7a5e62ce813cddb1732 
>   
> clients/src/test/java/org/apache/kafka/clients/producer/internals/SenderTest.java
>  8b1805d3d2bcb9fe2bacb37d870c3236aa9532c4 
>   
> clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java
>  e3cc1967e407b64cc734548c19e30de700b64ba8 
>   core/src/main/scala/kafka/api/FetchRequest.scala 
> b038c15186c0cbcc65b59479324052498361b717 
>   core/src/main/scala/kafka/api/FetchResponse.scala 
> 75aaf57fb76ec01660d93701a57ae953d877d81c 
>   core/src/main/scala/kafka/api/ProducerRequest.scala 
> 570b2da1d865086f9830aa919a49063abbbe574d 
>   core/src/main/scala/kafka/api/ProducerResponse.scala 
> 5d1fac4cb8943f5bfaa487f8e9d9d2856efbd330 
>   core/src/main/scala/kafka/consumer/SimpleConsumer.scala 
> 31a2639477bf66f9a05d2b9b07794572d7ec393b 
>   core/src/main/scala/kafka/server/AbstractFetcherThread.scala 
> a439046e118b6efcc3a5a9d9e8acb79f85e40398 
>   core/src/main/scala/kafka/server/DelayedFetch.scala 
> de6cf5bdaa0e70394162febc63b50b55ca0a92db 
>   core/src/main/scala/kafka/server/DelayedProduce.scala 
> 05078b24ef28f2f4e099afa943e43f1d00359fda 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/OffsetManager.scala 
> 18680ce100f10035175cc0263ba7787ab0f6a17a 
>   core/src/main/scala/kafka/server/ReplicaFetcherThread.scala 
> b31b432a226ba79546dd22ef1d2acbb439c2e9a3 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/test/scala/unit/kafka/api/RequestResponseSerializationTest.scala 
> 5717165f2344823fabe8f7cfafae4bb8af2d949a 
>   core/src/test/scala/unit/kafka/server/DelayedOperationTest.scala 
> f3ab3f4ff8eb1aa6b2ab87ba75f72eceb6649620 
>   core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala 
> 00d59337a99ac135e8689bd1ecd928f7b1423d79 
> 
> Diff: https://reviews.apache.org/r/33378/diff/
> 
> 
> Testing
> ---
> 
> New tests added
> 
> 
> Thanks,
> 
> Aditya Auradkar
> 
>



Re: Review Request 33049: Patch for KAFKA-2084

2015-05-12 Thread Dong Lin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33049/#review83441
---



clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java


metric.value(timeMs), which translates to Rate.measure(config, timeMs), may 
return Infinity. This may introduce infinite delay.

I think this bug is rooted either in class Rate or  class SampledStat. In 
short, SampledStat.purgeObsoleteSamples will remove all Samples, such that 
SampledStat.oldest(now) == now.

Should I open a ticket and submit a patch for it?



clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java


The function quits on the first quota violaation, and calculate delayTime 
based on this violation. 

Should we enumerate all metric in this.metrics() and set delayTime of the 
QuotaViolationException to be the largest delayTime of all violatoins?


- Dong Lin


On May 11, 2015, 11:17 p.m., Aditya Auradkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33049/
> ---
> 
> (Updated May 11, 2015, 11:17 p.m.)
> 
> 
> Review request for kafka, Joel Koshy and Jun Rao.
> 
> 
> Bugs: KAFKA-2084
> https://issues.apache.org/jira/browse/KAFKA-2084
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Updated patch for quotas. This patch does the following: 
> 1. Add per-client metrics for both producer and consumers 
> 2. Add configuration for quotas 
> 3. Compute delay times in the metrics package and return the delay times in 
> QuotaViolationException 
> 4. Add a DelayQueue in KafkaApi's that can be used to throttle any type of 
> request. Implemented request throttling for produce and fetch requests. 
> 5. Added unit and integration test cases.
> 6. This doesn't include a system test. There is a separate ticket for that
> 
> 
> Diffs
> -
> 
>   clients/src/main/java/org/apache/kafka/common/metrics/MetricConfig.java 
> dfa1b0a11042ad9d127226f0e0cec8b1d42b8441 
>   clients/src/main/java/org/apache/kafka/common/metrics/Quota.java 
> d82bb0c055e631425bc1ebbc7d387baac76aeeaa 
>   
> clients/src/main/java/org/apache/kafka/common/metrics/QuotaViolationException.java
>  a451e5385c9eca76b38b425e8ac856b2715fcffe 
>   clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java 
> ca823fd4639523018311b814fde69b6177e73b97 
>   clients/src/test/java/org/apache/kafka/common/utils/MockTime.java  
>   core/src/main/scala/kafka/server/ClientQuotaMetrics.scala PRE-CREATION 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/KafkaConfig.scala 
> 9efa15ca5567b295ab412ee9eea7c03eb4cdc18b 
>   core/src/main/scala/kafka/server/KafkaServer.scala 
> b7d2a2842e17411a823b93bdedc84657cbd62be1 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/main/scala/kafka/server/ThrottledRequest.scala PRE-CREATION 
>   core/src/main/scala/kafka/utils/ShutdownableThread.scala 
> fc226c863095b7761290292cd8755cd7ad0f155c 
>   core/src/test/scala/integration/kafka/api/QuotasTest.scala PRE-CREATION 
>   core/src/test/scala/unit/kafka/server/ClientQuotaMetricsTest.scala 
> PRE-CREATION 
>   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
> 8014a5a6c362785539f24eb03d77278434614fe6 
>   core/src/test/scala/unit/kafka/server/ThrottledRequestExpirationTest.scala 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/33049/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Aditya Auradkar
> 
>



[jira] [Commented] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Steve Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540520#comment-14540520
 ] 

Steve Miller commented on KAFKA-1554:
-

The foo-.old thing was just a way I knew I could reproduce the failure.

I'm suggesting that if Kafka ever restarts and has an issue on its way up, and 
it exits before it's fully up and running, it'll leave a ton of corrupted index 
files as a result.  That is, anything strange that happens in someone's 
environment, that causes a restart that isn't happy, will ultimately cause 
index corruption of this sort.

Maybe the original poster didn't end up in this state because of an unexpected 
restart of the sort I'm describing, but the current behavior seems worth 
fixing.  I'm betting there was an unexpected startup issue in that person's 
case but I admit I might be wrong. :)

Did that help?

> Corrupt index found on clean startup
> 
>
> Key: KAFKA-1554
> URL: https://issues.apache.org/jira/browse/KAFKA-1554
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
> Environment: ubuntu 12.04, oracle jdk 1.7
>Reporter: Alexis Midon
>Assignee: Mayuresh Gharat
>Priority: Critical
> Fix For: 0.9.0
>
> Attachments: KAFKA-1554.patch
>
>
> On a clean start up, corrupted index files are found.
> After investigations, it appears that some pre-allocated index files are not 
> "compacted" correctly and the end of the file is full of zeroes.
> As a result, on start up, the last relative offset is zero which yields an 
> offset equal to the base offset.
> The workaround is to delete all index files of size 10MB (the size of the 
> pre-allocated files), and restart. Index files will be re-created.
> {code}
> find $your_data_directory -size 10485760c -name *.index #-delete
> {code}
> This is issue might be related/similar to 
> https://issues.apache.org/jira/browse/KAFKA-1112
> {code}
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,696 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], starting
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,698 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], 
> Connecting to zookeeper on 
> zk-main0.XXX:2181,zk-main1.XXX:2181,zk-main2.:2181/production/kafka/main
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,708 
> INFO 
> ZkClient-EventThread-14-zk-main0.XXX.com:2181,zk-main1.XXX.com:2181,zk-main2.XXX.com:2181,zk-main3.XXX.com:2181,zk-main4.XXX.com:2181/production/kafka/main
>  org.I0Itec.zkclient.ZkEventThread.run - Starting ZkClient event thread.
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:host.name=i-6b948138.inst.aws.airbnb.com
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.version=1.7.0_55
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.vendor=Oracle Corporation
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.home=/usr/lib/jvm/jre-7-oracle-x64/jre
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.class.path=libs/snappy-java-1.0.5.jar:libs/scala-library-2.10.1.jar:libs/slf4j-api-1.7.2.jar:libs/jopt-simple-3.2.jar:libs/metrics-annotation-2.2.0.jar:libs/log4j-1.2.15.jar:libs/kafka_2.10-0.8.1.jar:libs/zkclient-0.3.jar:libs/zookeeper-3.3.4.jar:libs/metrics-core-2.2.0.jar
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.io.tmpdir=/tmp
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.compiler=
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.name=Linu

Re: [DISCUSS] KIP-19 Add a request timeout to NetworkClient

2015-05-12 Thread Ewen Cheslack-Postava
I think my confusion is coming from this:

> So in this KIP, we only address (3). The only public interface change is a
> new configuration of request timeout (and maybe change the configuration
> name of TIMEOUT_CONFIG to REPLICATION_TIMEOUT_CONFIG).

There are 3 possible compatibility issues I see here:

* I assumed this meant the constants also change, so "timeout.ms" becomes "
replication.timeout.ms". This breaks config files that worked on the
previous version and the only warning would be in release notes. We do warn
about unused configs so they might notice the problem.

* Binary and source compatibility if someone configures their client in
code and uses the TIMEOUT_CONFIG variable. Renaming it will cause existing
jars to break if you try to run against an updated client (which seems not
very significant since I doubt people upgrade these without recompiling but
maybe I'm wrong about that). And it breaks builds without have deprecated
that field first, which again, is probably not the biggest issue but is
annoying for users and when we accidentally changed the API we received a
complaint about breaking builds.

* Behavior compatibility as Jay mentioned on the call -- setting the config
(even if the name changed) doesn't have the same effect it used to.

One solution, which admittedly is more painful to implement and maintain,
would be to maintain the timeout.ms config, have it override the others if
it is specified (including an infinite request timeout I guess?), and if it
isn't specified, we can just use the new config variables. Given a real
deprecation schedule, users would have better warning of changes and a
window to make the changes.

I actually think it might not be necessary to maintain the old behavior
precisely, although maybe for some code it is an issue if they start seeing
timeout exceptions that they wouldn't have seen before?

-Ewen

On Wed, May 6, 2015 at 6:06 PM, Jun Rao  wrote:

> Jiangjie,
>
> Yes, I think using metadata timeout to expire batches in the record
> accumulator makes sense.
>
> Thanks,
>
> Jun
>
> On Mon, May 4, 2015 at 10:32 AM, Jiangjie Qin 
> wrote:
>
> > I incorporated Ewen and Guozhang’s comments in the KIP page. Want to
> speed
> > up on this KIP because currently we experience mirror-maker hung very
> > likely when a broker is down.
> >
> > I also took a shot to solve KAFKA-1788 in KAFKA-2142. I used metadata
> > timeout to expire the batches which are sitting in accumulator without
> > leader info. I did that because the situation there is essentially
> missing
> > metadata.
> >
> > As a summary of what I am thinking about the timeout in new Producer:
> >
> > 1. Metadata timeout:
> >   - used in send(), blocking
> >   - used in accumulator to expire batches with timeout exception.
> > 2. Linger.ms
> >   - Used in accumulator to ready the batch for drain
> > 3. Request timeout
> >   - Used in NetworkClient to expire a batch and retry if no response is
> > received for a request before timeout.
> >
> > So in this KIP, we only address (3). The only public interface change is
> a
> > new configuration of request timeout (and maybe change the configuration
> > name of TIMEOUT_CONFIG to REPLICATION_TIMEOUT_CONFIG).
> >
> > Would like to see what people think of above approach?
> >
> > Jiangjie (Becket) Qin
> >
> > On 4/20/15, 6:02 PM, "Jiangjie Qin"  wrote:
> >
> > >Jun,
> > >
> > >I thought a little bit differently on this.
> > >Intuitively, I am thinking that if a partition is offline, the metadata
> > >for that partition should be considered not ready because we don’t know
> > >which broker we should send the message to. So those sends need to be
> > >blocked on metadata timeout.
> > >Another thing I’m wondering is in which scenario an offline partition
> will
> > >become online again in a short period of time and how likely it will
> > >occur. My understanding is that the batch timeout for batches sitting in
> > >accumulator should be larger than linger.ms but should not be too long
> > >(e.g. less than 60 seconds). Otherwise it will exhaust the shared buffer
> > >with batches to be aborted.
> > >
> > >That said, I do agree it is reasonable to buffer the message for some
> time
> > >so messages to other partitions can still get sent. But adding another
> > >expiration in addition to linger.ms - which is essentially a timeout -
> > >sounds a little bit confusing. Maybe we can do this, let the batch sit
> in
> > >accumulator up to linger.ms, then fail it if necessary.
> > >
> > >What do you think?
> > >
> > >Thanks,
> > >
> > >Jiangjie (Becket) Qin
> > >
> > >On 4/20/15, 1:11 PM, "Jun Rao"  wrote:
> > >
> > >>Jiangjie,
> > >>
> > >>Allowing messages to be accumulated in an offline partition could be
> > >>useful
> > >>since the partition may become available before the request timeout or
> > >>linger time is reached. Now that we are planning to add a new timeout,
> it
> > >>would be useful to think through whether/how that applies to messages
> in
> > >>the acc

KIP Hangout notes

2015-05-12 Thread Gwen Shapira
My notes from the hangout:

* KIP-11: Based on feedback from Joel - group privileges will become READ.
Parth will open a new thread with a summary and re-open the vote.

* Updated that I’m still working on KAFKA-1928, Jay clarified that some of
the assertions may no longer be valid, but we should add the ones that make
sense

* Harsha asked for review of SSL code, Jun will pick it up.

* KIP-19: Even though using MetaData timeout for requests is confusing, it
makes sense to use for data still in accumulator when leader is not
available. Request timeout will be used for data out of accumulator.
Configuration should tie replication timeout and request timeout together
by default, to simplify. Also validate that request timeout is longer than
replication timeout. We need reasonable default values as well. Long
discussion on names of parameters. Guozhang will summarize issues
identified - mainly around retries.

* KIP-21: We will go with hybrid approach - server configuration will
remain in files and static (i.e. no SIGHUP), dynamic entities (topic,
clients, users, whatever) can have overrides in ZK. Aditya will update KIP
with details.

As usual, correct me if I got it wrong!

Gwen


[jira] [Commented] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540494#comment-14540494
 ] 

Joel Koshy commented on KAFKA-1554:
---

Thanks Steve - sorry I misread your earlier comment. The issue that you are 
raising is different - i.e., renaming a directory to foo-.old (for e.g.,) - 
yes that is a restriction. You have to delete the corrupt index to get past 
this (and you presumably don't want to lose the actual log segments with the 
data anyway).

We are looking for a way to reproduce the original index corruption or a 
reasonable hypothesis on how it happens.


> Corrupt index found on clean startup
> 
>
> Key: KAFKA-1554
> URL: https://issues.apache.org/jira/browse/KAFKA-1554
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
> Environment: ubuntu 12.04, oracle jdk 1.7
>Reporter: Alexis Midon
>Assignee: Mayuresh Gharat
>Priority: Critical
> Fix For: 0.9.0
>
> Attachments: KAFKA-1554.patch
>
>
> On a clean start up, corrupted index files are found.
> After investigations, it appears that some pre-allocated index files are not 
> "compacted" correctly and the end of the file is full of zeroes.
> As a result, on start up, the last relative offset is zero which yields an 
> offset equal to the base offset.
> The workaround is to delete all index files of size 10MB (the size of the 
> pre-allocated files), and restart. Index files will be re-created.
> {code}
> find $your_data_directory -size 10485760c -name *.index #-delete
> {code}
> This is issue might be related/similar to 
> https://issues.apache.org/jira/browse/KAFKA-1112
> {code}
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,696 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], starting
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,698 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], 
> Connecting to zookeeper on 
> zk-main0.XXX:2181,zk-main1.XXX:2181,zk-main2.:2181/production/kafka/main
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,708 
> INFO 
> ZkClient-EventThread-14-zk-main0.XXX.com:2181,zk-main1.XXX.com:2181,zk-main2.XXX.com:2181,zk-main3.XXX.com:2181,zk-main4.XXX.com:2181/production/kafka/main
>  org.I0Itec.zkclient.ZkEventThread.run - Starting ZkClient event thread.
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:host.name=i-6b948138.inst.aws.airbnb.com
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.version=1.7.0_55
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.vendor=Oracle Corporation
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.home=/usr/lib/jvm/jre-7-oracle-x64/jre
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.class.path=libs/snappy-java-1.0.5.jar:libs/scala-library-2.10.1.jar:libs/slf4j-api-1.7.2.jar:libs/jopt-simple-3.2.jar:libs/metrics-annotation-2.2.0.jar:libs/log4j-1.2.15.jar:libs/kafka_2.10-0.8.1.jar:libs/zkclient-0.3.jar:libs/zookeeper-3.3.4.jar:libs/metrics-core-2.2.0.jar
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.io.tmpdir=/tmp
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.compiler=
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.name=Linux
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.arch=amd64
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,717 
> INFO main org.ap

Re: [DISCUSS] KIP-21 Configuration Management

2015-05-12 Thread Neha Narkhede
Thanks for chiming in, Todd!

Agree that the lack of audit and rollback is a major downside of moving all
configs to ZooKeeper. Being able to configure dynamically created entities
in Kafka is required though. So I think what Todd suggested is a good
solution to managing all configs - catching SIGHUP for broker configs and
storing dynamic configs in ZK like we do today.

On Tue, May 12, 2015 at 10:30 AM, Jay Kreps  wrote:

> Hmm, here is how I think we can change the split brain proposal to make it
> a bit better:
> 1. Get rid of broker overrides, this is just done in the config file. This
> makes the precedence chain a lot clearer (e.g. zk always overrides file on
> a per-entity basis).
> 2. Get rid of the notion of dynamic configs in ConfigDef and in the broker.
> All overrides are dynamic and all server configs are static.
> 3. Create an equivalent of LogConfig for ClientConfig and any future config
> type we make.
> 4. Generalize the TopicConfigManager to handle multiple types of overrides.
>
> What we haven't done is try to think through how the pure zk approach would
> work.
>
> -Jay
>
>
>
> On Mon, May 11, 2015 at 10:53 PM, Ashish Singh 
> wrote:
>
> > I agree with the Joel's suggestion on keeping broker's configs in
> > config file and clients/topics config in ZK. Few other projects, Apache
> > Solr for one, also does something similar for its configurations.
> >
> > On Monday, May 11, 2015, Gwen Shapira  wrote:
> >
> > > I like this approach (obviously).
> > > I am also OK with supporting broker re-read of config file based on ZK
> > > watch instead of SIGHUP, if we see this as more consistent with the
> rest
> > of
> > > our code base.
> > >
> > > Either is fine by me as long as brokers keep the file and just do
> refresh
> > > :)
> > >
> > > On Tue, May 12, 2015 at 2:54 AM, Joel Koshy  > > > wrote:
> > >
> > > > So the general concern here is the dichotomy of configs (which we
> > > > already have - i.e., in the form of broker config files vs topic
> > > > configs in zookeeper). We (at LinkedIn) had some discussions on this
> > > > last week and had this very question for the operations team whose
> > > > opinion is I think to a large degree a touchstone for this decision:
> > > > "Has the operations team at LinkedIn experienced any pain so far with
> > > > managing topic configs in ZooKeeper (while broker configs are
> > > > file-based)?" It turns out that ops overwhelmingly favors the current
> > > > approach. i.e., service configs as file-based configs and
> client/topic
> > > > configs in ZooKeeper is intuitive and works great. This may be
> > > > somewhat counter-intuitive to devs, but this is one of those
> decisions
> > > > for which ops input is very critical - because for all practical
> > > > purposes, they are the users in this discussion.
> > > >
> > > > If we continue with this dichotomy and need to support dynamic config
> > > > for client/topic configs as well as select service configs then there
> > > > will need to be dichotomy in the config change mechanism as well.
> > > > i.e., client/topic configs will change via (say) a ZooKeeper watch
> and
> > > > the service config will change via a config file re-read (on SIGHUP)
> > > > after config changes have been pushed out to local files. Is this a
> > > > bad thing? Personally, I don't think it is - i.e. I'm in favor of
> this
> > > > approach. What do others think?
> > > >
> > > > Thanks,
> > > >
> > > > Joel
> > > >
> > > > On Mon, May 11, 2015 at 11:08:44PM +0300, Gwen Shapira wrote:
> > > > > What Todd said :)
> > > > >
> > > > > (I think my ops background is showing...)
> > > > >
> > > > > On Mon, May 11, 2015 at 10:17 PM, Todd Palino  > > > wrote:
> > > > >
> > > > > > I understand your point here, Jay, but I disagree that we can't
> > have
> > > > two
> > > > > > configuration systems. We have two different types of
> configuration
> > > > > > information. We have configuration that relates to the service
> > itself
> > > > (the
> > > > > > Kafka broker), and we have configuration that relates to the
> > content
> > > > within
> > > > > > the service (topics). I would put the client configuration
> (quotas)
> > > in
> > > > the
> > > > > > with the second part, as it is dynamic information. I just don't
> > see
> > > a
> > > > good
> > > > > > argument for effectively degrading the configuration for the
> > service
> > > > > > because of trying to keep it paired with the configuration of
> > dynamic
> > > > > > resources.
> > > > > >
> > > > > > -Todd
> > > > > >
> > > > > > On Mon, May 11, 2015 at 11:33 AM, Jay Kreps  > > >
> > > > wrote:
> > > > > >
> > > > > > > I totally agree that ZK is not in-and-of-itself a configuration
> > > > > > management
> > > > > > > solution and it would be better if we could just keep all our
> > > config
> > > > in
> > > > > > > files. Anyone who has followed the various config discussions
> > over
> > > > the
> > > > > > past
> > > > > > > few years of discussion knows I'm the biggest pro

[jira] [Commented] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Olson,Andrew (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540351#comment-14540351
 ] 

Olson,Andrew commented on KAFKA-2189:
-

I have verified that the issue [1] was introduced in snappy-java 1.1.1.2 and 
has already been fixed [2], in snappy-java 1.1.1.7.

[1] https://github.com/xerial/snappy-java/issues/100
[2] 
https://github.com/xerial/snappy-java/commit/dc2dd27f85e5167961883f71ac2681b73b33e5df

> Snappy compression of message batches less efficient in 0.8.2.1
> ---
>
> Key: KAFKA-2189
> URL: https://issues.apache.org/jira/browse/KAFKA-2189
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.8.2.1
>Reporter: Olson,Andrew
>Assignee: Jay Kreps
>
> We are using snappy compression and noticed a fairly substantial increase 
> (about 2.25x) in log filesystem space consumption after upgrading a Kafka 
> cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages 
> being seemingly recompressed individually (or possibly with a much smaller 
> buffer or dictionary?) instead of as a batch as sent by producers. We 
> eventually tracked down the change in compression ratio/scope to this [1] 
> commit that updated the snappy version from 1.0.5 to 1.1.1.3. The Kafka 
> client version does not appear to be relevant as we can reproduce this with 
> both the 0.8.1.1 and 0.8.2.1 Producer.
> Here are the log files from our troubleshooting that contain the same set of 
> 1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
> commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.
> {noformat}
> -rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
> /var/kafka2/f9d9b-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
> /var/kafka2/f9d9b-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
> /var/kafka2/f9d9b-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
> /var/kafka2/f9d9b-batch-1000-0/.log
> -rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
> /var/kafka2/f5ab8-batch-1-0/.log
> -rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
> /var/kafka2/f5ab8-batch-10-0/.log
> -rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
> /var/kafka2/f5ab8-batch-100-0/.log
> -rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
> /var/kafka2/f5ab8-batch-1000-0/.log
> {noformat}
> [1] 
> https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-21 Configuration Management

2015-05-12 Thread Jay Kreps
Hmm, here is how I think we can change the split brain proposal to make it
a bit better:
1. Get rid of broker overrides, this is just done in the config file. This
makes the precedence chain a lot clearer (e.g. zk always overrides file on
a per-entity basis).
2. Get rid of the notion of dynamic configs in ConfigDef and in the broker.
All overrides are dynamic and all server configs are static.
3. Create an equivalent of LogConfig for ClientConfig and any future config
type we make.
4. Generalize the TopicConfigManager to handle multiple types of overrides.

What we haven't done is try to think through how the pure zk approach would
work.

-Jay



On Mon, May 11, 2015 at 10:53 PM, Ashish Singh  wrote:

> I agree with the Joel's suggestion on keeping broker's configs in
> config file and clients/topics config in ZK. Few other projects, Apache
> Solr for one, also does something similar for its configurations.
>
> On Monday, May 11, 2015, Gwen Shapira  wrote:
>
> > I like this approach (obviously).
> > I am also OK with supporting broker re-read of config file based on ZK
> > watch instead of SIGHUP, if we see this as more consistent with the rest
> of
> > our code base.
> >
> > Either is fine by me as long as brokers keep the file and just do refresh
> > :)
> >
> > On Tue, May 12, 2015 at 2:54 AM, Joel Koshy  > > wrote:
> >
> > > So the general concern here is the dichotomy of configs (which we
> > > already have - i.e., in the form of broker config files vs topic
> > > configs in zookeeper). We (at LinkedIn) had some discussions on this
> > > last week and had this very question for the operations team whose
> > > opinion is I think to a large degree a touchstone for this decision:
> > > "Has the operations team at LinkedIn experienced any pain so far with
> > > managing topic configs in ZooKeeper (while broker configs are
> > > file-based)?" It turns out that ops overwhelmingly favors the current
> > > approach. i.e., service configs as file-based configs and client/topic
> > > configs in ZooKeeper is intuitive and works great. This may be
> > > somewhat counter-intuitive to devs, but this is one of those decisions
> > > for which ops input is very critical - because for all practical
> > > purposes, they are the users in this discussion.
> > >
> > > If we continue with this dichotomy and need to support dynamic config
> > > for client/topic configs as well as select service configs then there
> > > will need to be dichotomy in the config change mechanism as well.
> > > i.e., client/topic configs will change via (say) a ZooKeeper watch and
> > > the service config will change via a config file re-read (on SIGHUP)
> > > after config changes have been pushed out to local files. Is this a
> > > bad thing? Personally, I don't think it is - i.e. I'm in favor of this
> > > approach. What do others think?
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Mon, May 11, 2015 at 11:08:44PM +0300, Gwen Shapira wrote:
> > > > What Todd said :)
> > > >
> > > > (I think my ops background is showing...)
> > > >
> > > > On Mon, May 11, 2015 at 10:17 PM, Todd Palino  > > wrote:
> > > >
> > > > > I understand your point here, Jay, but I disagree that we can't
> have
> > > two
> > > > > configuration systems. We have two different types of configuration
> > > > > information. We have configuration that relates to the service
> itself
> > > (the
> > > > > Kafka broker), and we have configuration that relates to the
> content
> > > within
> > > > > the service (topics). I would put the client configuration (quotas)
> > in
> > > the
> > > > > with the second part, as it is dynamic information. I just don't
> see
> > a
> > > good
> > > > > argument for effectively degrading the configuration for the
> service
> > > > > because of trying to keep it paired with the configuration of
> dynamic
> > > > > resources.
> > > > >
> > > > > -Todd
> > > > >
> > > > > On Mon, May 11, 2015 at 11:33 AM, Jay Kreps  > >
> > > wrote:
> > > > >
> > > > > > I totally agree that ZK is not in-and-of-itself a configuration
> > > > > management
> > > > > > solution and it would be better if we could just keep all our
> > config
> > > in
> > > > > > files. Anyone who has followed the various config discussions
> over
> > > the
> > > > > past
> > > > > > few years of discussion knows I'm the biggest proponent of
> > immutable
> > > > > > file-driven config.
> > > > > >
> > > > > > The analogy to "normal unix services" isn't actually quite right
> > > though.
> > > > > > The problem Kafka has is that a number of the configurable
> entities
> > > it
> > > > > > manages are added dynamically--topics, clients, consumer groups,
> > etc.
> > > > > What
> > > > > > this actually resembles is not a unix services like HTTPD but a
> > > database,
> > > > > > and databases typically do manage config dynamically for exactly
> > the
> > > same
> > > > > > reason.
> > > > > >
> > > > > > The last few emails are arguing that files > ZK as a config
> > > solution. I
> 

[jira] [Created] (KAFKA-2189) Snappy compression of message batches less efficient in 0.8.2.1

2015-05-12 Thread Olson,Andrew (JIRA)
Olson,Andrew created KAFKA-2189:
---

 Summary: Snappy compression of message batches less efficient in 
0.8.2.1
 Key: KAFKA-2189
 URL: https://issues.apache.org/jira/browse/KAFKA-2189
 Project: Kafka
  Issue Type: Bug
  Components: log
Affects Versions: 0.8.2.1
Reporter: Olson,Andrew
Assignee: Jay Kreps


We are using snappy compression and noticed a fairly substantial increase 
(about 2.25x) in log filesystem space consumption after upgrading a Kafka 
cluster from 0.8.1.1 to 0.8.2.1. We found that this is caused by messages being 
seemingly recompressed individually (or possibly with a much smaller buffer or 
dictionary?) instead of as a batch as sent by producers. We eventually tracked 
down the change in compression ratio/scope to this [1] commit that updated the 
snappy version from 1.0.5 to 1.1.1.3. The Kafka client version does not appear 
to be relevant as we can reproduce this with both the 0.8.1.1 and 0.8.2.1 
Producer.

Here are the log files from our troubleshooting that contain the same set of 
1000 messages, for batch sizes of 1, 10, 100, and 1000. f9d9b was the last 
commit with 0.8.1.1-like behavior prior to f5ab8 introducing the issue.

{noformat}
-rw-rw-r-- 1 kafka kafka 404967 May 12 11:45 
/var/kafka2/f9d9b-batch-1-0/.log
-rw-rw-r-- 1 kafka kafka 119951 May 12 11:45 
/var/kafka2/f9d9b-batch-10-0/.log
-rw-rw-r-- 1 kafka kafka  89645 May 12 11:45 
/var/kafka2/f9d9b-batch-100-0/.log
-rw-rw-r-- 1 kafka kafka  88279 May 12 11:45 
/var/kafka2/f9d9b-batch-1000-0/.log

-rw-rw-r-- 1 kafka kafka 402837 May 12 11:41 
/var/kafka2/f5ab8-batch-1-0/.log
-rw-rw-r-- 1 kafka kafka 382437 May 12 11:41 
/var/kafka2/f5ab8-batch-10-0/.log
-rw-rw-r-- 1 kafka kafka 364791 May 12 11:41 
/var/kafka2/f5ab8-batch-100-0/.log
-rw-rw-r-- 1 kafka kafka 380693 May 12 11:41 
/var/kafka2/f5ab8-batch-1000-0/.log
{noformat}

[1] 
https://github.com/apache/kafka/commit/f5ab8e1780cf80f267906e3259ad4f9278c32d28 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Mayuresh Gharat (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540270#comment-14540270
 ] 

Mayuresh Gharat commented on KAFKA-1554:


This sounds interesting. We may try to reproduce this and check.

> Corrupt index found on clean startup
> 
>
> Key: KAFKA-1554
> URL: https://issues.apache.org/jira/browse/KAFKA-1554
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
> Environment: ubuntu 12.04, oracle jdk 1.7
>Reporter: Alexis Midon
>Assignee: Mayuresh Gharat
>Priority: Critical
> Fix For: 0.9.0
>
> Attachments: KAFKA-1554.patch
>
>
> On a clean start up, corrupted index files are found.
> After investigations, it appears that some pre-allocated index files are not 
> "compacted" correctly and the end of the file is full of zeroes.
> As a result, on start up, the last relative offset is zero which yields an 
> offset equal to the base offset.
> The workaround is to delete all index files of size 10MB (the size of the 
> pre-allocated files), and restart. Index files will be re-created.
> {code}
> find $your_data_directory -size 10485760c -name *.index #-delete
> {code}
> This is issue might be related/similar to 
> https://issues.apache.org/jira/browse/KAFKA-1112
> {code}
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,696 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], starting
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,698 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], 
> Connecting to zookeeper on 
> zk-main0.XXX:2181,zk-main1.XXX:2181,zk-main2.:2181/production/kafka/main
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,708 
> INFO 
> ZkClient-EventThread-14-zk-main0.XXX.com:2181,zk-main1.XXX.com:2181,zk-main2.XXX.com:2181,zk-main3.XXX.com:2181,zk-main4.XXX.com:2181/production/kafka/main
>  org.I0Itec.zkclient.ZkEventThread.run - Starting ZkClient event thread.
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:host.name=i-6b948138.inst.aws.airbnb.com
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.version=1.7.0_55
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.vendor=Oracle Corporation
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.home=/usr/lib/jvm/jre-7-oracle-x64/jre
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.class.path=libs/snappy-java-1.0.5.jar:libs/scala-library-2.10.1.jar:libs/slf4j-api-1.7.2.jar:libs/jopt-simple-3.2.jar:libs/metrics-annotation-2.2.0.jar:libs/log4j-1.2.15.jar:libs/kafka_2.10-0.8.1.jar:libs/zkclient-0.3.jar:libs/zookeeper-3.3.4.jar:libs/metrics-core-2.2.0.jar
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.io.tmpdir=/tmp
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.compiler=
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.name=Linux
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.arch=amd64
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,717 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.version=3.2.0-61-virtual
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,717 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:user.name=kafka
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,717 
> INFO main org.apache.zookeeper.Zo

Re: Review Request 34050: Patch for KAFKA-2169

2015-05-12 Thread Parth Brahmbhatt

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34050/#review83427
---



core/src/main/scala/kafka/controller/KafkaController.scala


I don't understand why this needs to be done which is why I haven't 
addressed it. Can you elloborate why would it matter which one of the 2 calls 
exits the process?


- Parth Brahmbhatt


On May 11, 2015, 8:53 p.m., Parth Brahmbhatt wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34050/
> ---
> 
> (Updated May 11, 2015, 8:53 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2169
> https://issues.apache.org/jira/browse/KAFKA-2169
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> System.exit instead of throwing RuntimeException when zokeeper session 
> establishment fails.
> 
> 
> Removing the unnecessary @throws.
> 
> 
> Diffs
> -
> 
>   build.gradle fef515b3b2276b1f861e7cc2e33e74c3ce5e405b 
>   core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala 
> aa8d9404a3e78a365df06404b79d0d8f694b4bd6 
>   core/src/main/scala/kafka/consumer/ZookeeperTopicEventWatcher.scala 
> 38f4ec0bd1b388cc8fc04b38bbb2e7aaa1c3f43b 
>   core/src/main/scala/kafka/controller/KafkaController.scala 
> a6351163f5b6f080d6fa50bcc3533d445fcbc067 
>   core/src/main/scala/kafka/server/KafkaHealthcheck.scala 
> 861b7f644941f88ce04a4e95f6b28d18bf1db16d 
> 
> Diff: https://reviews.apache.org/r/34050/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Parth Brahmbhatt
> 
>



[jira] [Commented] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Evan Huus (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540218#comment-14540218
 ] 

Evan Huus commented on KAFKA-2147:
--

1. 4-8k messages/s snappy-compressed, taking 2.5-5 MB/s bandwidth. Now that we 
have several days of data, I can say that the problem appears much less severe 
(thought still present) during off-peak times. The average purgatory size 
definitely rises and falls together with the producer throughput.

It is probably worth noting that the majority of our producer's traffic 
(slightly over 90%) is on just two topics, while the remaining topics tend to 
see *much* lower volume (~10 messages/minute in some cases).

2. Yes. As I wrote above: "Correlates pretty strongly with the time since the 
previous purge". When the purgatory was purged frequently, the purges were 
small. The first purge after each ~30-second gap was much larger. 
NumDelayedRequests hovers between 200-250 regardless of the purgatory size.

3. We do not have GC log enabled, however: we have tried both the default and 
G1 collector with no difference. We are currently running G1, and the stats it 
exposes over JMX look sane.

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 

[jira] [Commented] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540168#comment-14540168
 ] 

Jun Rao commented on KAFKA-2147:


A few other questions.
1. What's the produce request rate? We only add the replica fetch request to 
the purgatory if there is no new byte across all fetched partitions. So, for 
the purgatory to build up, the producer has to send data at a rate that's not 
frequent enough to prevent the replica fetch requests from being added to the 
purgatory, but frequent enough so that the fetch request can be satisfied and 
new fetch request can be issued.

2. Does more frequent purging correlate with purging more elements? Being able 
to purge 210073 is a bit suspicious. It either means that we are not purging 
frequent enough or the purging is slow. What's the observed NumDelayedRequests?

3. Do you have GC log enabled? How frequent and how long is the gc?

> Unbalanced replication can cause extreme purgatory growth
> -
>
> Key: KAFKA-2147
> URL: https://issues.apache.org/jira/browse/KAFKA-2147
> Project: Kafka
>  Issue Type: Bug
>  Components: purgatory, replication
>Affects Versions: 0.8.2.1
>Reporter: Evan Huus
>Assignee: Joel Koshy
>
> Apologies in advance, this is going to be a bit of complex description, 
> mainly because we've seen this issue several different ways and we're still 
> tying them together in terms of root cause and analysis.
> It is worth noting now that we have all our producers set up to send 
> RequiredAcks==-1, and that this includes all our MirrorMakers.
> I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
> that will incidentally fix this issue, or at least render it moot.
> h4. Symptoms
> Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
> a rate of roughly 1-5K requests per second. Heap memory used also grows to 
> keep pace. When 4-5 million requests have accumulated in purgatory, the 
> purgatory is drained, causing a substantial latency spike. The node will tend 
> to drop leadership, replicate, and recover.
> h5. Case 1 - MirrorMaker
> We first noticed this case when enabling mirrormaker. We had one primary 
> cluster already, with many producers and consumers. We created a second, 
> identical cluster and enabled replication from the original to the new 
> cluster on some topics using mirrormaker. This caused all six nodes in the 
> new cluster to exhibit the symptom in lockstep - their purgatories would all 
> grow together, and get drained within about 20 seconds of each other. The 
> cluster-wide latency spikes at this time caused several problems for us.
> Turning MM on and off turned the problem on and off very precisely. When we 
> stopped MM, the purgatories would all drop to normal levels immediately, and 
> would start climbing again when we restarted it.
> Note that this is the *fetch* purgatories on the brokers that MM was 
> *producing* to, which indicates fairly strongly that this is a replication 
> issue, not a MM issue.
> This particular cluster and MM setup was abandoned for other reasons before 
> we could make much progress debugging.
> h5. Case 2 - Broker 6
> The second time we saw this issue was on the newest broker (broker 6) in the 
> original cluster. For a long time we were running with five nodes, and 
> eventually added a sixth to handle the increased load. At first, we moved 
> only a handful of higher-volume partitions to this broker. Later, we created 
> a group of new topics (totalling around 100 partitions) for testing purposes 
> that were spread automatically across all six nodes. These topics saw 
> occasional traffic, but were generally unused. At this point broker 6 had 
> leadership for about an equal number of high-volume and unused partitions, 
> about 15-20 of each.
> Around this time (we don't have detailed enough data to prove real 
> correlation unfortunately), the issue started appearing on this broker as 
> well, but not on any of the other brokers in the cluster.
> h4. Debugging
> The first thing we tried was to reduce the 
> `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
> lower value of 200. This had no noticeable effect at all.
> We then enabled debug logging on broker06 and started looking through that. I 
> can attach complete log samples if necessary, but the thing that stood out 
> for us was a substantial number of the following lines:
> {noformat}
> [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
> correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
> (kafka.server.KafkaApis)
> {noformat}
> The volume of these lines seemed to match (approximately) the fetch purgatory 
> growth on that broker.
> At this point we develope

[jira] [Comment Edited] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Steve Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539985#comment-14539985
 ] 

Steve Miller edited comment on KAFKA-1554 at 5/12/15 3:09 PM:
--

Hm, that seems to have rolled out of the logs at this point -- sorry.  I don't 
think that specific error was an issue: I think that anything that makes Kakfa 
exit after it's created the new segment index but before it's compacted it is 
an issue.

I think that if you do the following, you can reproduce the issue:

* create a new, empty topic, and don't publish anything to it.  For each 
partition in that topic, you should end up with the usual topicname-partition# 
directory somewhere (e.g., 'junk-2'), which will have an empty log file and an 
index file of size 10485760.
* copy the whole junk-2 directory somewhere safe (e.g., /var/tmp/junk-2).
* stop kafka
* cp -pr /var/tmp/junk-2 /whatever/kafka/log/junk-2.old
* start kafka

When you do that, Kafka will fail to start properly, with an error like:

{code}
[2015-05-12 15:00:55,661] FATAL [Kafka Server 1], Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.NumberFormatException: For input string: "6.old"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:247)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
at kafka.log.Log$.parseTopicPartitionName(Log.scala:833)
at 
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$7$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:138)
at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

and you'll also now have many index files of size 10485760.

Note that after a clean shutdown, you don't have any index files of that size, 
so I think what's happening here is something like:

* the code runs that replaces the working index file in any log directory with 
the 10485760-sized files, which Kafka uses (presumably) for the current log 
segment
* something blows up
* the usual code that would run at clean execution to replace the current, 
10485760-sized files with their compacted versions doesn't run because of the 
blowup

leaving these files on disk.  And then at startup Kafka panics because it 
considers them corrupted.

I was able to use these steps on a spare broker we had and I could get this to 
repeat at will. :)


was (Author: stevemil00):
Hm, that seems to have rolled out of the logs at this point -- sorry.  I don't 
think that specific error was an issue: I think that anything that makes Kakfa 
exit after it's created the new segment index but before it's compacted it is 
an issue.

I think that if you do the following, you can reproduce the issue:

* create a new, empty topic, and don't publish anything to it.  For each 
partition in that topic, you should end up with the usual topicname-partition# 
directory somewhere (e.g., 'junk-2', which will have an empty log file and an 
index file of size 10485760.
* copy the whole junk-2 directory somewhere safe (e.g., /var/tmp/junk-2).
* stop kafka
* cp -pr /var/tmp/junk-2 /whatever/kafka/log/junk-2.old
* start kafka

When you do that, Kafka will fail to start properly, with an error like:

{code}
[2015-05-12 15:00:55,661] FATAL [Kafka Server 1], Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.NumberFormatException: For input string: "6.old"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:247)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
at kafka.log.Log$.parseTopicPartitionName(Log.scala:833)
at 
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$7$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:138)
at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.conc

[jira] [Commented] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Steve Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539985#comment-14539985
 ] 

Steve Miller commented on KAFKA-1554:
-

Hm, that seems to have rolled out of the logs at this point -- sorry.  I don't 
think that specific error was an issue: I think that anything that makes Kakfa 
exit after it's created the new segment index but before it's compacted it is 
an issue.

I think that if you do the following, you can reproduce the issue:

* create a new, empty topic, and don't publish anything to it.  For each 
partition in that topic, you should end up with the usual topicname-partition# 
directory somewhere (e.g., 'junk-2', which will have an empty log file and an 
index file of size 10485760.
* copy the whole junk-2 directory somewhere safe (e.g., /var/tmp/junk-2).
* stop kafka
* cp -pr /var/tmp/junk-2 /whatever/kafka/log/junk-2.old
* start kafka

When you do that, Kafka will fail to start properly, with an error like:

{code}
[2015-05-12 15:00:55,661] FATAL [Kafka Server 1], Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.NumberFormatException: For input string: "6.old"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:247)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
at kafka.log.Log$.parseTopicPartitionName(Log.scala:833)
at 
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$7$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:138)
at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

and you'll also now have many index files of size 10485760.

Note that after a clean shutdown, you don't have any index files of that size, 
so I think what's happening here is something like:

* the code runs that replaces the working index file in any log directory with 
the 10485760-sized files, which Kafka uses (presumably) for the current log 
segment
* something blows up
* the usual code that would run at clean execution to replace the current, 
10485760-sized files with their compacted versions doesn't run because of the 
blowup

leaving these files on disk.  And then at startup Kafka panics because it 
considers them corrupted.

I was able to use these steps on a spare broker we had and I could get this to 
repeat at will. :)

> Corrupt index found on clean startup
> 
>
> Key: KAFKA-1554
> URL: https://issues.apache.org/jira/browse/KAFKA-1554
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
> Environment: ubuntu 12.04, oracle jdk 1.7
>Reporter: Alexis Midon
>Assignee: Mayuresh Gharat
>Priority: Critical
> Fix For: 0.9.0
>
> Attachments: KAFKA-1554.patch
>
>
> On a clean start up, corrupted index files are found.
> After investigations, it appears that some pre-allocated index files are not 
> "compacted" correctly and the end of the file is full of zeroes.
> As a result, on start up, the last relative offset is zero which yields an 
> offset equal to the base offset.
> The workaround is to delete all index files of size 10MB (the size of the 
> pre-allocated files), and restart. Index files will be re-created.
> {code}
> find $your_data_directory -size 10485760c -name *.index #-delete
> {code}
> This is issue might be related/similar to 
> https://issues.apache.org/jira/browse/KAFKA-1112
> {code}
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,696 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], starting
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,698 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], 
> Connecting to zookeeper on 
> zk-main0.XXX:2181,zk-main1.XXX:2181,zk-main2.:2181/production/kafka/main
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,708 
> INFO 
> ZkClient-EventThread-14-zk-main0.XXX.com:2181,zk-main1.XXX.com:2181,zk-main2.XXX.com:2181,zk-main3.XXX.com:2181,zk-main4.XXX.com:2181/production/kafka/main
>  org.I0Itec.zkclient.ZkEventThread.run - Starting ZkClient event thread.
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-

Re: Review Request 33557: Patch for KAFKA-1936

2015-05-12 Thread Dong Lin


> On May 12, 2015, 2:45 p.m., Joel Koshy wrote:
> > core/src/main/scala/kafka/server/KafkaRequestHandler.scala, line 147
> > 
> >
> > Just a minor naming comment: these `mark*Rate` methods can just be 
> > called `record*`
> > 
> > For e.g., `recordMessagesIn`

I see. Previously I wanted to be more verbose and name the function based on 
what is done (e.g. *Rate.mark(..)).

Could you look at the updated patch? I have renamed these functions. Thank you.


- Dong


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33557/#review83398
---


On May 12, 2015, 3:02 p.m., Dong Lin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33557/
> ---
> 
> (Updated May 12, 2015, 3:02 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1936
> https://issues.apache.org/jira/browse/KAFKA-1936
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> KAFKA-1936; Track offset commit requests separately from produce requests
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/log/Log.scala 
> 84e7b8fe9dd014884b60c4fbe13c835cf02a40e4 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/KafkaRequestHandler.scala 
> a1558afed20bc651ca442a774920d782890167a5 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/test/scala/unit/kafka/server/OffsetCommitTest.scala 
> 528525b719ec916e16f8b3ae3715bec4b5dcc47d 
> 
> Diff: https://reviews.apache.org/r/33557/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Dong Lin
> 
>



[jira] [Updated] (KAFKA-1936) Track offset commit requests separately from produce requests

2015-05-12 Thread Dong Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-1936:

Attachment: KAFKA-1936_2015-05-12_08:02:32.patch

> Track offset commit requests separately from produce requests
> -
>
> Key: KAFKA-1936
> URL: https://issues.apache.org/jira/browse/KAFKA-1936
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Dong Lin
> Attachments: KAFKA-1936.patch, KAFKA-1936_2015-04-26_21:20:28.patch, 
> KAFKA-1936_2015-04-26_21:26:34.patch, KAFKA-1936_2015-05-04_15:17:50.patch, 
> KAFKA-1936_2015-05-12_07:32:07.patch, KAFKA-1936_2015-05-12_08:02:32.patch
>
>
> In ReplicaManager, failed and total produce requests are updated from 
> appendToLocalLog. Since offset commit requests also follow the same path, 
> they are counted along with produce requests. Add a metric and count them 
> separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1936) Track offset commit requests separately from produce requests

2015-05-12 Thread Dong Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539974#comment-14539974
 ] 

Dong Lin commented on KAFKA-1936:
-

Updated reviewboard https://reviews.apache.org/r/33557/diff/
 against branch origin/trunk

> Track offset commit requests separately from produce requests
> -
>
> Key: KAFKA-1936
> URL: https://issues.apache.org/jira/browse/KAFKA-1936
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Dong Lin
> Attachments: KAFKA-1936.patch, KAFKA-1936_2015-04-26_21:20:28.patch, 
> KAFKA-1936_2015-04-26_21:26:34.patch, KAFKA-1936_2015-05-04_15:17:50.patch, 
> KAFKA-1936_2015-05-12_07:32:07.patch, KAFKA-1936_2015-05-12_08:02:32.patch
>
>
> In ReplicaManager, failed and total produce requests are updated from 
> appendToLocalLog. Since offset commit requests also follow the same path, 
> they are counted along with produce requests. Add a metric and count them 
> separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33557: Patch for KAFKA-1936

2015-05-12 Thread Dong Lin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33557/
---

(Updated May 12, 2015, 3:02 p.m.)


Review request for kafka.


Bugs: KAFKA-1936
https://issues.apache.org/jira/browse/KAFKA-1936


Repository: kafka


Description
---

KAFKA-1936; Track offset commit requests separately from produce requests


Diffs (updated)
-

  core/src/main/scala/kafka/log/Log.scala 
84e7b8fe9dd014884b60c4fbe13c835cf02a40e4 
  core/src/main/scala/kafka/server/KafkaApis.scala 
417960dd1ab407ebebad8fdb0e97415db3e91a2f 
  core/src/main/scala/kafka/server/KafkaRequestHandler.scala 
a1558afed20bc651ca442a774920d782890167a5 
  core/src/main/scala/kafka/server/ReplicaManager.scala 
59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
  core/src/test/scala/unit/kafka/server/OffsetCommitTest.scala 
528525b719ec916e16f8b3ae3715bec4b5dcc47d 

Diff: https://reviews.apache.org/r/33557/diff/


Testing
---


Thanks,

Dong Lin



Re: Review Request 33557: Patch for KAFKA-1936

2015-05-12 Thread Joel Koshy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33557/#review83398
---

Ship it!



core/src/main/scala/kafka/server/KafkaRequestHandler.scala


Just a minor naming comment: these `mark*Rate` methods can just be called 
`record*`

For e.g., `recordMessagesIn`


- Joel Koshy


On May 12, 2015, 2:32 p.m., Dong Lin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33557/
> ---
> 
> (Updated May 12, 2015, 2:32 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1936
> https://issues.apache.org/jira/browse/KAFKA-1936
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> KAFKA-1936; Track offset commit requests separately from produce requests
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/log/Log.scala 
> 84e7b8fe9dd014884b60c4fbe13c835cf02a40e4 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/KafkaRequestHandler.scala 
> a1558afed20bc651ca442a774920d782890167a5 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/test/scala/unit/kafka/server/OffsetCommitTest.scala 
> 528525b719ec916e16f8b3ae3715bec4b5dcc47d 
> 
> Diff: https://reviews.apache.org/r/33557/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Dong Lin
> 
>



Re: Review Request 33557: Patch for KAFKA-1936

2015-05-12 Thread Dong Lin


> On May 11, 2015, 5:25 p.m., Joel Koshy wrote:
> > core/src/main/scala/kafka/server/KafkaRequestHandler.scala, line 147
> > 
> >
> > How about
> > markBrokerTopicMeters
> > or
> > updateBrokerTopicStats
> > 
> > I think an even clearer approach would be to have explicit methods:
> > 
> > ```
> > messagesIn(n)
> > bytesIn(n)
> > bytesOut(n)
> > bytesRejected(n)
> > ```
> > and so on.
> > 
> > The current code assumes everything is a meter (which it is) but the 
> > above may be clearer and makes fewer assumptions about the underlying 
> > metric types.
> > 
> > It may also eliminate the need for the enumeration.
> > 
> > What do you think?

Joel: Thanks much for your suggestions.

I have uploaded a patch that uses the explict methods (e.g. 
markMessageIn(topic, n)) for each of the eight meter in BrokerTopicMetrics, and 
removed enumeration. 

Overall the code in KafkaRequestHandler is a bit longer. But it does avoid the 
need of extra enumeration class and makes less assumption about the underlying 
metric. I think the approach is better.

Thanks again.


- Dong


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33557/#review83234
---


On May 12, 2015, 2:32 p.m., Dong Lin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33557/
> ---
> 
> (Updated May 12, 2015, 2:32 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1936
> https://issues.apache.org/jira/browse/KAFKA-1936
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> KAFKA-1936; Track offset commit requests separately from produce requests
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/log/Log.scala 
> 84e7b8fe9dd014884b60c4fbe13c835cf02a40e4 
>   core/src/main/scala/kafka/server/KafkaApis.scala 
> 417960dd1ab407ebebad8fdb0e97415db3e91a2f 
>   core/src/main/scala/kafka/server/KafkaRequestHandler.scala 
> a1558afed20bc651ca442a774920d782890167a5 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> 59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
>   core/src/test/scala/unit/kafka/server/OffsetCommitTest.scala 
> 528525b719ec916e16f8b3ae3715bec4b5dcc47d 
> 
> Diff: https://reviews.apache.org/r/33557/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Dong Lin
> 
>



[jira] [Commented] (KAFKA-1936) Track offset commit requests separately from produce requests

2015-05-12 Thread Dong Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539906#comment-14539906
 ] 

Dong Lin commented on KAFKA-1936:
-

Updated reviewboard https://reviews.apache.org/r/33557/diff/
 against branch origin/trunk

> Track offset commit requests separately from produce requests
> -
>
> Key: KAFKA-1936
> URL: https://issues.apache.org/jira/browse/KAFKA-1936
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Dong Lin
> Attachments: KAFKA-1936.patch, KAFKA-1936_2015-04-26_21:20:28.patch, 
> KAFKA-1936_2015-04-26_21:26:34.patch, KAFKA-1936_2015-05-04_15:17:50.patch, 
> KAFKA-1936_2015-05-12_07:32:07.patch
>
>
> In ReplicaManager, failed and total produce requests are updated from 
> appendToLocalLog. Since offset commit requests also follow the same path, 
> they are counted along with produce requests. Add a metric and count them 
> separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-1936) Track offset commit requests separately from produce requests

2015-05-12 Thread Dong Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-1936:

Attachment: KAFKA-1936_2015-05-12_07:32:07.patch

> Track offset commit requests separately from produce requests
> -
>
> Key: KAFKA-1936
> URL: https://issues.apache.org/jira/browse/KAFKA-1936
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Aditya Auradkar
>Assignee: Dong Lin
> Attachments: KAFKA-1936.patch, KAFKA-1936_2015-04-26_21:20:28.patch, 
> KAFKA-1936_2015-04-26_21:26:34.patch, KAFKA-1936_2015-05-04_15:17:50.patch, 
> KAFKA-1936_2015-05-12_07:32:07.patch
>
>
> In ReplicaManager, failed and total produce requests are updated from 
> appendToLocalLog. Since offset commit requests also follow the same path, 
> they are counted along with produce requests. Add a metric and count them 
> separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33557: Patch for KAFKA-1936

2015-05-12 Thread Dong Lin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33557/
---

(Updated May 12, 2015, 2:32 p.m.)


Review request for kafka.


Bugs: KAFKA-1936
https://issues.apache.org/jira/browse/KAFKA-1936


Repository: kafka


Description (updated)
---

KAFKA-1936; Track offset commit requests separately from produce requests


Diffs (updated)
-

  core/src/main/scala/kafka/log/Log.scala 
84e7b8fe9dd014884b60c4fbe13c835cf02a40e4 
  core/src/main/scala/kafka/server/KafkaApis.scala 
417960dd1ab407ebebad8fdb0e97415db3e91a2f 
  core/src/main/scala/kafka/server/KafkaRequestHandler.scala 
a1558afed20bc651ca442a774920d782890167a5 
  core/src/main/scala/kafka/server/ReplicaManager.scala 
59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
  core/src/test/scala/unit/kafka/server/OffsetCommitTest.scala 
528525b719ec916e16f8b3ae3715bec4b5dcc47d 

Diff: https://reviews.apache.org/r/33557/diff/


Testing
---


Thanks,

Dong Lin



[jira] [Commented] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539883#comment-14539883
 ] 

Joel Koshy commented on KAFKA-1554:
---

That is useful to know - it might help reproducing this. Do you by any chance 
have stack traces of the ClassNotFound or equivalent?
Yes we create new segment indexes with that size but compact it when the 
segment rolls.


> Corrupt index found on clean startup
> 
>
> Key: KAFKA-1554
> URL: https://issues.apache.org/jira/browse/KAFKA-1554
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
> Environment: ubuntu 12.04, oracle jdk 1.7
>Reporter: Alexis Midon
>Assignee: Mayuresh Gharat
>Priority: Critical
> Fix For: 0.9.0
>
> Attachments: KAFKA-1554.patch
>
>
> On a clean start up, corrupted index files are found.
> After investigations, it appears that some pre-allocated index files are not 
> "compacted" correctly and the end of the file is full of zeroes.
> As a result, on start up, the last relative offset is zero which yields an 
> offset equal to the base offset.
> The workaround is to delete all index files of size 10MB (the size of the 
> pre-allocated files), and restart. Index files will be re-created.
> {code}
> find $your_data_directory -size 10485760c -name *.index #-delete
> {code}
> This is issue might be related/similar to 
> https://issues.apache.org/jira/browse/KAFKA-1112
> {code}
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,696 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], starting
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,698 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], 
> Connecting to zookeeper on 
> zk-main0.XXX:2181,zk-main1.XXX:2181,zk-main2.:2181/production/kafka/main
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,708 
> INFO 
> ZkClient-EventThread-14-zk-main0.XXX.com:2181,zk-main1.XXX.com:2181,zk-main2.XXX.com:2181,zk-main3.XXX.com:2181,zk-main4.XXX.com:2181/production/kafka/main
>  org.I0Itec.zkclient.ZkEventThread.run - Starting ZkClient event thread.
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:host.name=i-6b948138.inst.aws.airbnb.com
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.version=1.7.0_55
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.vendor=Oracle Corporation
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.home=/usr/lib/jvm/jre-7-oracle-x64/jre
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.class.path=libs/snappy-java-1.0.5.jar:libs/scala-library-2.10.1.jar:libs/slf4j-api-1.7.2.jar:libs/jopt-simple-3.2.jar:libs/metrics-annotation-2.2.0.jar:libs/log4j-1.2.15.jar:libs/kafka_2.10-0.8.1.jar:libs/zkclient-0.3.jar:libs/zookeeper-3.3.4.jar:libs/metrics-core-2.2.0.jar
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.io.tmpdir=/tmp
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.compiler=
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.name=Linux
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.arch=amd64
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,717 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:os.version=3.2.0-61-virtual
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,717 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Cli

[jira] [Commented] (KAFKA-1374) LogCleaner (compaction) does not support compressed topics

2015-05-12 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539856#comment-14539856
 ] 

Joel Koshy commented on KAFKA-1374:
---

Sorry I dropped this. I just reviewed the patch. I think it looks good, but 
needs a rebase. Let me know if you are swamped though and we can help with it.

> LogCleaner (compaction) does not support compressed topics
> --
>
> Key: KAFKA-1374
> URL: https://issues.apache.org/jira/browse/KAFKA-1374
> Project: Kafka
>  Issue Type: Bug
>Reporter: Joel Koshy
>Assignee: Manikumar Reddy
>  Labels: newbie++
> Fix For: 0.8.3
>
> Attachments: KAFKA-1374.patch, KAFKA-1374_2014-08-09_16:18:55.patch, 
> KAFKA-1374_2014-08-12_22:23:06.patch, KAFKA-1374_2014-09-23_21:47:12.patch, 
> KAFKA-1374_2014-10-03_18:49:16.patch, KAFKA-1374_2014-10-03_19:17:17.patch, 
> KAFKA-1374_2015-01-18_00:19:21.patch
>
>
> This is a known issue, but opening a ticket to track.
> If you try to compact a topic that has compressed messages you will run into
> various exceptions - typically because during iteration we advance the
> position based on the decompressed size of the message. I have a bunch of
> stack traces, but it should be straightforward to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 24214: Patch for KAFKA-1374

2015-05-12 Thread Joel Koshy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24214/#review83392
---


Sorry for the delay. Overall, this looks good.

As discussed earlier, this patch needs a minor rebase.

There are a couple of points to note:
- In KAFKA-1499 you added broker-side compression. When writing out the 
compacted messages, we should compress using the configured compression codec. 
We can do this as an incremental change if you prefer. i.e., your current patch 
makes the log cleaner compression-aware. A subsequent patch can handle writing 
out to the configured codec. That part could be non-trivial as we would then 
probably want to do some batching when writing out compacted compressed 
messages.
- In KAFKA-1755 I had added some defensive code to prevent compressed messages 
and unkeyed messages from getting in. The compression-related code will need to 
be removed. Again, let me know if you need any help with this.

Let me know if you need help with any of this.


core/src/main/scala/kafka/log/LogCleaner.scala


I would suggest one of two options over this (i.e., instead of two helper 
methods)
- Inline both here and get rid of those
- Have a single private helper (e.g., collectRetainedMessages)



core/src/main/scala/kafka/log/LogCleaner.scala


We should now compress with the compression codec of the topic (KAFKA-1499)



core/src/main/scala/kafka/log/LogCleaner.scala


We should instead do a trivial refactor in ByteBufferMessageSet to compress 
messages in a preallocated buffer. It would be preferable to avoid having this 
compression logic in different places.


- Joel Koshy


On Jan. 17, 2015, 6:53 p.m., Manikumar Reddy O wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24214/
> ---
> 
> (Updated Jan. 17, 2015, 6:53 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1374
> https://issues.apache.org/jira/browse/KAFKA-1374
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Updating the rebased code
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/log/LogCleaner.scala 
> f8e7cd5fabce78c248a9027c4bb374a792508675 
>   core/src/main/scala/kafka/tools/TestLogCleaning.scala 
> af496f7c547a5ac7a4096a6af325dad0d8feec6f 
>   core/src/test/scala/unit/kafka/log/LogCleanerIntegrationTest.scala 
> 07acd460b1259e0a3f4069b8b8dcd8123ef5810e 
> 
> Diff: https://reviews.apache.org/r/24214/diff/
> 
> 
> Testing
> ---
> 
> /*TestLogCleaning stress test output for compressed messages/
> 
> Producing 10 messages...
> Logging produce requests to 
> /tmp/kafka-log-cleaner-produced-6014466306002699464.txt
> Sleeping for 120 seconds...
> Consuming messages...
> Logging consumed messages to 
> /tmp/kafka-log-cleaner-consumed-177538909590644701.txt
> 10 rows of data produced, 13165 rows of data consumed (86.8% reduction).
> De-duplicating and validating output files...
> Validated 9005 values, 0 mismatches.
> 
> Producing 100 messages...
> Logging produce requests to 
> /tmp/kafka-log-cleaner-produced-3298578695475992991.txt
> Sleeping for 120 seconds...
> Consuming messages...
> Logging consumed messages to 
> /tmp/kafka-log-cleaner-consumed-7192293977610206930.txt
> 100 rows of data produced, 119926 rows of data consumed (88.0% reduction).
> De-duplicating and validating output files...
> Validated 89947 values, 0 mismatches.
> 
> Producing 1000 messages...
> Logging produce requests to 
> /tmp/kafka-log-cleaner-produced-3336255463347572934.txt
> Sleeping for 120 seconds...
> Consuming messages...
> Logging consumed messages to 
> /tmp/kafka-log-cleaner-consumed-9149188270705707725.txt
> 1000 rows of data produced, 1645281 rows of data consumed (83.5% 
> reduction).
> De-duplicating and validating output files...
> Validated 899853 values, 0 mismatches.
> 
> 
> /*TestLogCleaning stress test output for non-compressed messages*/
> 
> Producing 10 messages...
> Logging produce requests to 
> /tmp/kafka-log-cleaner-produced-5174543709786189363.txt
> Sleeping for 120 seconds...
> Consuming messages...
> Logging consumed messages to 
> /tmp/kafka-log-cleaner-consumed-514345501144701.txt
> 10 rows of data produced, 22775 rows of data consumed (77.2% reduction).
> De-duplicating and validating output files...
> Validated 17874 values, 0 mismatches.
> 
> Producing 100 messages...
> Logging produce requests to 
> /tmp/kafka-log-cleaner-produced-7814446915546169271.txt
> Sleeping for 120 seconds...
> Consuming messages...
> Logging consumed messages to 
> /tmp/kafka-log-cleaner-consumed-51

[jira] [Updated] (KAFKA-2188) JBOD Support

2015-05-12 Thread Andrii Biletskyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrii Biletskyi updated KAFKA-2188:

Status: Patch Available  (was: Open)

> JBOD Support
> 
>
> Key: KAFKA-2188
> URL: https://issues.apache.org/jira/browse/KAFKA-2188
> Project: Kafka
>  Issue Type: Bug
>Reporter: Andrii Biletskyi
>Assignee: Andrii Biletskyi
> Attachments: KAFKA-2188.patch
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2188) JBOD Support

2015-05-12 Thread Andrii Biletskyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrii Biletskyi updated KAFKA-2188:

Attachment: KAFKA-2188.patch

> JBOD Support
> 
>
> Key: KAFKA-2188
> URL: https://issues.apache.org/jira/browse/KAFKA-2188
> Project: Kafka
>  Issue Type: Bug
>Reporter: Andrii Biletskyi
>Assignee: Andrii Biletskyi
> Attachments: KAFKA-2188.patch
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2188) JBOD Support

2015-05-12 Thread Andrii Biletskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539777#comment-14539777
 ] 

Andrii Biletskyi commented on KAFKA-2188:
-

Created reviewboard https://reviews.apache.org/r/34103/diff/
 against branch origin/trunk

> JBOD Support
> 
>
> Key: KAFKA-2188
> URL: https://issues.apache.org/jira/browse/KAFKA-2188
> Project: Kafka
>  Issue Type: Bug
>Reporter: Andrii Biletskyi
>Assignee: Andrii Biletskyi
> Attachments: KAFKA-2188.patch
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 34103: Patch for KAFKA-2188

2015-05-12 Thread Andrii Biletskyi

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34103/
---

Review request for kafka.


Bugs: KAFKA-2188
https://issues.apache.org/jira/browse/KAFKA-2188


Repository: kafka


Description
---

KAFKA-2188 - JBOD Support


Diffs
-

  core/src/main/scala/kafka/cluster/Partition.scala 
122b1dbbe45cb27aed79b5be1e735fb617c716b0 
  core/src/main/scala/kafka/common/GenericKafkaStorageException.scala 
PRE-CREATION 
  core/src/main/scala/kafka/controller/KafkaController.scala 
a6351163f5b6f080d6fa50bcc3533d445fcbc067 
  core/src/main/scala/kafka/controller/PartitionLeaderSelector.scala 
3b15ab4eef22c6f50a7483e99a6af40fb55aca9f 
  core/src/main/scala/kafka/log/Log.scala 
84e7b8fe9dd014884b60c4fbe13c835cf02a40e4 
  core/src/main/scala/kafka/log/LogManager.scala 
e781ebac2677ebb22e0c1fef0cf7e5ad57c74ea4 
  core/src/main/scala/kafka/log/LogSegment.scala 
ed039539ac18ea4d65144073915cf112f7374631 
  core/src/main/scala/kafka/server/KafkaApis.scala 
417960dd1ab407ebebad8fdb0e97415db3e91a2f 
  core/src/main/scala/kafka/server/KafkaConfig.scala 
9efa15ca5567b295ab412ee9eea7c03eb4cdc18b 
  core/src/main/scala/kafka/server/OffsetCheckpoint.scala 
8c5b0546908d3b3affb9f48e2ece9ed252518783 
  core/src/main/scala/kafka/server/ReplicaFetcherThread.scala 
b31b432a226ba79546dd22ef1d2acbb439c2e9a3 
  core/src/main/scala/kafka/server/ReplicaManager.scala 
59c9bc3ac3a8afc07a6f8c88c5871304db588d17 
  core/src/main/scala/kafka/utils/ZkUtils.scala 
1da8f90b3a7abda5868186bddf221e31adbe02ce 
  core/src/test/scala/unit/kafka/log/LogManagerTest.scala 
01dfbc4f8d21f6905327cd4ed6c61d657adc0143 
  core/src/test/scala/unit/kafka/server/HighwatermarkPersistenceTest.scala 
60cd8249e6ec03349e20bb0a7226ea9cd66e6b17 
  core/src/test/scala/unit/kafka/utils/TestUtils.scala 
faae0e907596a16c47e8d49a82b6a3c82797c96d 

Diff: https://reviews.apache.org/r/34103/diff/


Testing
---


Thanks,

Andrii Biletskyi



[jira] [Created] (KAFKA-2188) JBOD Support

2015-05-12 Thread Andrii Biletskyi (JIRA)
Andrii Biletskyi created KAFKA-2188:
---

 Summary: JBOD Support
 Key: KAFKA-2188
 URL: https://issues.apache.org/jira/browse/KAFKA-2188
 Project: Kafka
  Issue Type: Bug
Reporter: Andrii Biletskyi
Assignee: Andrii Biletskyi


https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1554) Corrupt index found on clean startup

2015-05-12 Thread Steve Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539621#comment-14539621
 ] 

Steve Miller commented on KAFKA-1554:
-

I just ran into this.  In my case, we'd tried to load some metrics class that 
we hadn't actually installed, so Kafka blew up pretty early in its 
initialization sequence.

If anything causes Kafka to bomb out at startup, it creates all these corrupt 
index files.  I can remove them all, then start things up again, and if Kafka 
crashes (e.g., if I somehow missed removing one of these files, or if in an 
earlier attempt I'd renamed some topic partition directory from foo-5 to 
foo-5.old and Kafka didn't like parsing "5.old" as a number, just to pick an 
example :) ) they're all present again.

Is there some startup-related index-file thing it's doing where it temporarily 
cranks the file size up to 10485760 (or it seeks out to that spot in the file, 
maybe)?

> Corrupt index found on clean startup
> 
>
> Key: KAFKA-1554
> URL: https://issues.apache.org/jira/browse/KAFKA-1554
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
> Environment: ubuntu 12.04, oracle jdk 1.7
>Reporter: Alexis Midon
>Assignee: Mayuresh Gharat
>Priority: Critical
> Fix For: 0.9.0
>
> Attachments: KAFKA-1554.patch
>
>
> On a clean start up, corrupted index files are found.
> After investigations, it appears that some pre-allocated index files are not 
> "compacted" correctly and the end of the file is full of zeroes.
> As a result, on start up, the last relative offset is zero which yields an 
> offset equal to the base offset.
> The workaround is to delete all index files of size 10MB (the size of the 
> pre-allocated files), and restart. Index files will be re-created.
> {code}
> find $your_data_directory -size 10485760c -name *.index #-delete
> {code}
> This is issue might be related/similar to 
> https://issues.apache.org/jira/browse/KAFKA-1112
> {code}
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,696 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], starting
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,698 
> INFO main kafka.server.KafkaServer.info - [Kafka Server 847605514], 
> Connecting to zookeeper on 
> zk-main0.XXX:2181,zk-main1.XXX:2181,zk-main2.:2181/production/kafka/main
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,708 
> INFO 
> ZkClient-EventThread-14-zk-main0.XXX.com:2181,zk-main1.XXX.com:2181,zk-main2.XXX.com:2181,zk-main3.XXX.com:2181,zk-main4.XXX.com:2181/production/kafka/main
>  org.I0Itec.zkclient.ZkEventThread.run - Starting ZkClient event thread.
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:host.name=i-6b948138.inst.aws.airbnb.com
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,714 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.version=1.7.0_55
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.vendor=Oracle Corporation
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.home=/usr/lib/jvm/jre-7-oracle-x64/jre
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.class.path=libs/snappy-java-1.0.5.jar:libs/scala-library-2.10.1.jar:libs/slf4j-api-1.7.2.jar:libs/jopt-simple-3.2.jar:libs/metrics-annotation-2.2.0.jar:libs/log4j-1.2.15.jar:libs/kafka_2.10-0.8.1.jar:libs/zkclient-0.3.jar:libs/zookeeper-3.3.4.jar:libs/metrics-core-2.2.0.jar
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,715 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.io.tmpdir=/tmp
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.apache.zookeeper.ZooKeeper.logEnv - Client 
> environment:java.compiler=
> 2014-07-11T00:53:17+00:00 i-6b948138 local3.info 2014-07-11 - 00:53:17,716 
> INFO main org.a

[jira] [Commented] (KAFKA-1928) Move kafka.network over to using the network classes in org.apache.kafka.common.network

2015-05-12 Thread Gwen Shapira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539606#comment-14539606
 ] 

Gwen Shapira commented on KAFKA-1928:
-

Updated reviewboard https://reviews.apache.org/r/33065/diff/
 against branch trunk

> Move kafka.network over to using the network classes in 
> org.apache.kafka.common.network
> ---
>
> Key: KAFKA-1928
> URL: https://issues.apache.org/jira/browse/KAFKA-1928
> Project: Kafka
>  Issue Type: Sub-task
>  Components: security
>Reporter: Jay Kreps
>Assignee: Gwen Shapira
> Attachments: KAFKA-1928.patch, KAFKA-1928_2015-04-28_00:09:40.patch, 
> KAFKA-1928_2015-04-30_17:48:33.patch, KAFKA-1928_2015-05-01_15:45:24.patch, 
> KAFKA-1928_2015-05-12_12:00:37.patch, KAFKA-1928_2015-05-12_12:57:57.patch
>
>
> As part of the common package we introduced a bunch of network related code 
> and abstractions.
> We should look into replacing a lot of what is in kafka.network with this 
> code. Duplicate classes include things like Receive, Send, etc. It is likely 
> possible to also refactor the SocketServer to make use of Selector which 
> should significantly simplify it's code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-1928) Move kafka.network over to using the network classes in org.apache.kafka.common.network

2015-05-12 Thread Gwen Shapira (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gwen Shapira updated KAFKA-1928:

Attachment: KAFKA-1928_2015-05-12_12:57:57.patch

> Move kafka.network over to using the network classes in 
> org.apache.kafka.common.network
> ---
>
> Key: KAFKA-1928
> URL: https://issues.apache.org/jira/browse/KAFKA-1928
> Project: Kafka
>  Issue Type: Sub-task
>  Components: security
>Reporter: Jay Kreps
>Assignee: Gwen Shapira
> Attachments: KAFKA-1928.patch, KAFKA-1928_2015-04-28_00:09:40.patch, 
> KAFKA-1928_2015-04-30_17:48:33.patch, KAFKA-1928_2015-05-01_15:45:24.patch, 
> KAFKA-1928_2015-05-12_12:00:37.patch, KAFKA-1928_2015-05-12_12:57:57.patch
>
>
> As part of the common package we introduced a bunch of network related code 
> and abstractions.
> We should look into replacing a lot of what is in kafka.network with this 
> code. Duplicate classes include things like Receive, Send, etc. It is likely 
> possible to also refactor the SocketServer to make use of Selector which 
> should significantly simplify it's code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33065: Patch for KAFKA-1928

2015-05-12 Thread Gwen Shapira

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33065/
---

(Updated May 12, 2015, 9:58 a.m.)


Review request for kafka.


Bugs: 1928 and KAFKA-1928
https://issues.apache.org/jira/browse/1928
https://issues.apache.org/jira/browse/KAFKA-1928


Repository: kafka


Description (updated)
---

first pass on replacing Send


implement maxSize and improved docs


Merge branch 'trunk' of http://git-wip-us.apache.org/repos/asf/kafka into 
KAFKA-1928-v2


Merge branch 'trunk' of http://git-wip-us.apache.org/repos/asf/kafka into 
KAFKA-1928-v2

Conflicts:
core/src/main/scala/kafka/network/RequestChannel.scala

moved selector out of abstract thread


mid-way through putting selector in SocketServer


Merge branch 'trunk' of http://git-wip-us.apache.org/repos/asf/kafka into 
KAFKA-1928-v2

Also, SocketServer is now using Selector. Stil a bit messy - but all tests pass.

Merge branch 'trunk' of http://git-wip-us.apache.org/repos/asf/kafka into 
KAFKA-1928-v2


renamed requestKey to connectionId to reflect new use and changed type from Any 
to String


Following Jun's comments - moved MultiSend to client. Cleaned up destinations 
as well


removed reify and remaining from send/recieve API, per Jun. moved 
maybeCloseOldest() to Selector per Jay


added idString to node API, changed written to int in Send API


Diffs (updated)
-

  clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java 
da76cc257b4cfe3c4bce7120a1f14c7f31ef8587 
  clients/src/main/java/org/apache/kafka/clients/InFlightRequests.java 
936487b16e7ac566f8bdcd39a7240ceb619fd30e 
  clients/src/main/java/org/apache/kafka/clients/KafkaClient.java 
1311f85847b022efec8cb05c450bb18231db6979 
  clients/src/main/java/org/apache/kafka/clients/NetworkClient.java 
b7ae595f2cc46e5dfe728bc3ce6082e9cd0b6d36 
  clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java 
bdff518b732105823058e6182f445248b45dc388 
  clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
d301be4709f7b112e1f3a39f3c04cfa65f00fa60 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Coordinator.java
 e55ab11df4db0b0084f841a74cbcf819caf780d5 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java 
ef9dd5238fbc771496029866ece1d85db6d7b7a5 
  clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java 
42b12928781463b56fc4a45d96bb4da2745b6d95 
  clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java 
187d0004c8c46b6664ddaffecc6166d4b47351e5 
  clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
b2db91ca14bbd17fef5ce85839679144fff3f689 
  clients/src/main/java/org/apache/kafka/common/Node.java 
f4e4186c7602787e58e304a2f1c293a633114656 
  clients/src/main/java/org/apache/kafka/common/network/ByteBufferReceive.java 
129ae827bccbd982ad93d56e46c6f5c46f147fe0 
  clients/src/main/java/org/apache/kafka/common/network/ByteBufferSend.java 
c8213e156ec9c9af49ee09f5238492318516aaa3 
  clients/src/main/java/org/apache/kafka/common/network/MultiSend.java 
PRE-CREATION 
  clients/src/main/java/org/apache/kafka/common/network/NetworkReceive.java 
fc0d168324aaebb97065b0aafbd547a1994d76a7 
  clients/src/main/java/org/apache/kafka/common/network/NetworkSend.java 
68327cd3a734fd429966d3e2016a2488dbbb19e5 
  clients/src/main/java/org/apache/kafka/common/network/Receive.java 
4e33078c1eec834bd74aabcb5fc69f18c9d6d52a 
  clients/src/main/java/org/apache/kafka/common/network/Selectable.java 
b5f8d83e89f9026dc0853e5f92c00b2d7f043e22 
  clients/src/main/java/org/apache/kafka/common/network/Selector.java 
57de0585e5e9a53eb9dcd99cac1ab3eb2086a302 
  clients/src/main/java/org/apache/kafka/common/network/Send.java 
5d321a09e470166a1c33639cf0cab26a3bce98ec 
  clients/src/main/java/org/apache/kafka/common/requests/RequestSend.java 
27cbf390c7f148ffa8c5abc154c72cbf0829715c 
  clients/src/main/java/org/apache/kafka/common/requests/ResponseSend.java 
PRE-CREATION 
  clients/src/test/java/org/apache/kafka/clients/MockClient.java 
5e3fab13e3c02eb351558ec973b949b3d1196085 
  clients/src/test/java/org/apache/kafka/clients/NetworkClientTest.java 
8b278892883e63899b53e15efb9d8c926131e858 
  clients/src/test/java/org/apache/kafka/common/network/SelectorTest.java 
d5b306b026e788b4e5479f3419805aa49ae889f3 
  clients/src/test/java/org/apache/kafka/test/MockSelector.java 
ea89b06a4c9e5bb351201299cd3037f5226f0e6c 
  core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala 
1c3b3802ac221d570e7610458e50518b4499e7ed 
  core/src/main/scala/kafka/api/ConsumerMetadataRequest.scala 
a3b1b78adb760eaeb029466b54f335a29caf3b0f 
  core/src/main/scala/kafka/api/ControlledShutdownRequest.scala 
fe81635c864cec03ca1d4681c9c47c3fc4f975ee 
  core/src/main/scala/kafka/api/FetchRequest.scala 
b038c15186c0cbcc65b59479324052498

  1   2   >