orders of launching kafka servers and zookeepers

2013-05-22 Thread Yu, Libo
Hi,

I want to launch kafka on three machines. I can launch zookeepers
on the three machines first. After that, start kafka server on each
machine. Or for each machine, I start a zookeeper followed by the kafka.
I believe the first way is the right way to go. But I want to confirm it.


Regards,

Libo



Re: orders of launching kafka servers and zookeepers

2013-05-22 Thread Neha Narkhede
First launch the zookeeper cluster completely followed by the kafka
cluster.

Thanks,
Neha
On May 22, 2013 8:43 AM, Yu, Libo libo...@citi.com wrote:

 Hi,

 I want to launch kafka on three machines. I can launch zookeepers
 on the three machines first. After that, start kafka server on each
 machine. Or for each machine, I start a zookeeper followed by the kafka.
 I believe the first way is the right way to go. But I want to confirm it.


 Regards,

 Libo




Partitioning and scale

2013-05-22 Thread Timothy Chen
Hi,

I'm currently trying to understand how Kafka (0.8) can scale with our usage
pattern and how to setup the partitioning.

We want to route the same messages belonging to the same id to the same
queue, so its consumer will able to consume all the messages of that id.

My questions:

 - From my understanding, in Kafka we would need to have a custom
partitioner that routes the same messages to the same partition right?  I'm
trying to find examples of writing this partitioner logic, but I can't find
any. Can someone point me to an example?

- I see that Kafka server.properties allows one to specify the number of
partitions it supports. However, when we want to scale I wonder if we add #
of partitions or # of brokers, will the same partitioner start distributing
the messages to different partitions?
 And if it does, how can that same consumer continue to read off the
messages of those ids if it was interrupted in the middle?

- I'd like to create a consumer per partition, and for each one to
subscribe to the changes of that one. How can this be done in kafka?

Thanks,

Tim


Re: Partitioning and scale

2013-05-22 Thread Chris Curtin
Hi Tim,


On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote:

 Hi,

 I'm currently trying to understand how Kafka (0.8) can scale with our usage
 pattern and how to setup the partitioning.

 We want to route the same messages belonging to the same id to the same
 queue, so its consumer will able to consume all the messages of that id.

 My questions:

  - From my understanding, in Kafka we would need to have a custom
 partitioner that routes the same messages to the same partition right?  I'm
 trying to find examples of writing this partitioner logic, but I can't find
 any. Can someone point me to an example?

 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example

The partitioner here does a simple mod on the IP address and the # of
partitions. You'd need to define your own logic, but this is a start.


 - I see that Kafka server.properties allows one to specify the number of
 partitions it supports. However, when we want to scale I wonder if we add #
 of partitions or # of brokers, will the same partitioner start distributing
 the messages to different partitions?
  And if it does, how can that same consumer continue to read off the
 messages of those ids if it was interrupted in the middle?


I'll let someone else answer this.



 - I'd like to create a consumer per partition, and for each one to
 subscribe to the changes of that one. How can this be done in kafka?


Two ways: Simple Consumer or Consumer Groups:

Depends on the level of control you want on code processing a specific
partition vs. getting one assigned to it (and level of control over offset
management).

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example

https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Examplehttps://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example



 Thanks,

 Tim



Re: Partitioning and scale

2013-05-22 Thread Neha Narkhede
- I see that Kafka server.properties allows one to specify the number of
partitions it supports. However, when we want to scale I wonder if we add #
of partitions or # of brokers, will the same partitioner start distributing
the messages to different partitions?
 And if it does, how can that same consumer continue to read off the
messages of those ids if it was interrupted in the middle?

The num.partitions config in server.properties is used only for topics that
are auto created (controlled by auto.create.topics.enable). For topics that
you create using the admin tool, you can specify the number of partitions
that you want. After that, currently there is no way to change that. For
that reason, it is a good idea to over partition your topic, which also
helps load balance partitions onto the brokers. You are right that if you
change the number of partitions later, then previously messages that stuck
to a certain partition would now get routed to a different partition, which
is undesirable for applications that want to use sticky partitioning.

- I'd like to create a consumer per partition, and for each one to
subscribe to the changes of that one. How can this be done in kafka?

For your use case, it seems like SimpleConsumer might be a better fit.
However, it will require you to write code to handle discovery of leader
for the partition that your consumer is consuming. Chris has written up a
great example that you can follow -
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example

Thanks,
Neha


On Wed, May 22, 2013 at 12:37 PM, Chris Curtin curtin.ch...@gmail.comwrote:

 Hi Tim,


 On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote:

  Hi,
 
  I'm currently trying to understand how Kafka (0.8) can scale with our
 usage
  pattern and how to setup the partitioning.
 
  We want to route the same messages belonging to the same id to the same
  queue, so its consumer will able to consume all the messages of that id.
 
  My questions:
 
   - From my understanding, in Kafka we would need to have a custom
  partitioner that routes the same messages to the same partition right?
  I'm
  trying to find examples of writing this partitioner logic, but I can't
 find
  any. Can someone point me to an example?
 
  https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example

 The partitioner here does a simple mod on the IP address and the # of
 partitions. You'd need to define your own logic, but this is a start.


  - I see that Kafka server.properties allows one to specify the number of
  partitions it supports. However, when we want to scale I wonder if we
 add #
  of partitions or # of brokers, will the same partitioner start
 distributing
  the messages to different partitions?
   And if it does, how can that same consumer continue to read off the
  messages of those ids if it was interrupted in the middle?
 

 I'll let someone else answer this.


 
  - I'd like to create a consumer per partition, and for each one to
  subscribe to the changes of that one. How can this be done in kafka?
 

 Two ways: Simple Consumer or Consumer Groups:

 Depends on the level of control you want on code processing a specific
 partition vs. getting one assigned to it (and level of control over offset
 management).

 https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example


 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
 https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example


 
  Thanks,
 
  Tim
 



Apache Kafka in AWS

2013-05-22 Thread Jason Weiss
All,

I asked a number of questions of the group over the last week, and I'm happy to 
report that I've had great success getting Kafka up and running in AWS. I am 
using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large 
with 8 cores and 58.4 GiB of memory according to the AWS specs. I have 
co-located Zookeeper instances next to Zafka on each machine.

I am able to publish in a repeatable fashion 273,000 events per second, with 
each event payload consisting of a fixed size of 2048 bytes! This represents 
the maximum throughput possible on this configuration, as the servers became 
CPU constrained, averaging 97% utilization in a relatively flat line. This 
isn't a burst speed – it represents a sustained throughput from 20 M1 Large 
EC2 Kafka multi-threaded producers. Putting this into perspective, if my log 
retention period was a month, I'd be aggregating 1.3 petabytes of data on my 
disk drives. Suffice to say, I don't see us retaining data for more than a few 
hours!

Here were the keys to tuning for future folks to consider:

First and foremost, be sure to configure your Java heap size accordingly when 
you launch Kafka. The default is like 512MB, which in my case left virtually 
all of my RAM inaccessible to Kafka.
Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my 
side, and I almost gave up on Kafka because of the problems I encountered. The 
OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in 
dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka 
didn't puke once- I mean, like not even a hiccup.
Third know your message size. In my opinion, the more you understand about your 
event payload characteristics, the better you can tune the system. The two 
knobs to really turn are the log.flush.interval and 
log.default.flush.interval.ms. The values here are intrinsically connected to 
the types of payloads you are putting through the system.
Fourth and finally, to maximize throughput you have to code against the async 
paradigm, and be prepared to tweak the batch size, queue properties, and 
compression codec (wait for it…) in a way that matches the message payload you 
are putting through the system and the capabilities of the producer system 
itself.


Jason





This electronic message contains information which may be confidential or 
privileged. The information is intended for the use of the individual or entity 
named above. If you are not the intended recipient, be aware that any 
disclosure, copying, distribution or use of the contents of this information is 
prohibited. If you have received this electronic transmission in error, please 
notify us by e-mail at (postmas...@rapid7.com) immediately.


Re: Apache Kafka in AWS

2013-05-22 Thread Neha Narkhede
Thanks for sharing your experience with the community, Jason!

-Neha


On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote:

 All,

 I asked a number of questions of the group over the last week, and I'm
 happy to report that I've had great success getting Kafka up and running in
 AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
 Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the
 AWS specs. I have co-located Zookeeper instances next to Zafka on each
 machine.

 I am able to publish in a repeatable fashion 273,000 events per second,
 with each event payload consisting of a fixed size of 2048 bytes! This
 represents the maximum throughput possible on this configuration, as the
 servers became CPU constrained, averaging 97% utilization in a relatively
 flat line. This isn't a burst speed – it represents a sustained
 throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
 this into perspective, if my log retention period was a month, I'd be
 aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
 don't see us retaining data for more than a few hours!

 Here were the keys to tuning for future folks to consider:

 First and foremost, be sure to configure your Java heap size accordingly
 when you launch Kafka. The default is like 512MB, which in my case left
 virtually all of my RAM inaccessible to Kafka.
 Second, stay away from OpenJDK. No, seriously – this was a huge thorn in
 my side, and I almost gave up on Kafka because of the problems I
 encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
 crashing and burning in dramatic fashion. The moment I switched over to
 Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
 hiccup.
 Third know your message size. In my opinion, the more you understand about
 your event payload characteristics, the better you can tune the system. The
 two knobs to really turn are the log.flush.interval and
 log.default.flush.interval.ms. The values here are intrinsically
 connected to the types of payloads you are putting through the system.
 Fourth and finally, to maximize throughput you have to code against the
 async paradigm, and be prepared to tweak the batch size, queue properties,
 and compression codec (wait for it…) in a way that matches the message
 payload you are putting through the system and the capabilities of the
 producer system itself.


 Jason





 This electronic message contains information which may be confidential or
 privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.



Re: message ordering guarantees

2013-05-22 Thread Neha Narkhede
Thanks,
Neha
On May 21, 2013 5:42 PM, Ross Black ross.w.bl...@gmail.com wrote:

 Hi,

 I am using Kafka 0.7.1, and using SyncProducer and SimpleConsumer with a
 single broker service process.

 I am occasionally seeing messages (from a *single* partition) being
 processed out of order to what I expect and I am trying to find where the
 problem lies.  The problem may well be in my code - I just would like to
 eliminate Kafka as a potential cause.

 Messages are being sent sequentially from the producer process, using a
 single SyncProducer.
 Does Kafka provide any guarantees for message ordering in this case?

 eg.  If The sync producer sends messages A then B then C, does the Kafka
 broker guarantee that messages will be persisted with the order A,B,C?
 If not, is there any way to ensure this ordering?
 Has anything changed in 0.8 that could be used to ensure that ordering?


 Thanks,
 Ross



Re: Partitioning and scale

2013-05-22 Thread Timothy Chen
Hi Neha/Chris,

Thanks for the reply, so if I set a fixed number of partitions and just add
brokers to the broker pool, does it rebalance the load to the new brokers
(along with the data)?

Tim


On Wed, May 22, 2013 at 1:15 PM, Neha Narkhede neha.narkh...@gmail.comwrote:

 - I see that Kafka server.properties allows one to specify the number of
 partitions it supports. However, when we want to scale I wonder if we add #
 of partitions or # of brokers, will the same partitioner start distributing
 the messages to different partitions?
  And if it does, how can that same consumer continue to read off the
 messages of those ids if it was interrupted in the middle?

 The num.partitions config in server.properties is used only for topics that
 are auto created (controlled by auto.create.topics.enable). For topics that
 you create using the admin tool, you can specify the number of partitions
 that you want. After that, currently there is no way to change that. For
 that reason, it is a good idea to over partition your topic, which also
 helps load balance partitions onto the brokers. You are right that if you
 change the number of partitions later, then previously messages that stuck
 to a certain partition would now get routed to a different partition, which
 is undesirable for applications that want to use sticky partitioning.

 - I'd like to create a consumer per partition, and for each one to
 subscribe to the changes of that one. How can this be done in kafka?

 For your use case, it seems like SimpleConsumer might be a better fit.
 However, it will require you to write code to handle discovery of leader
 for the partition that your consumer is consuming. Chris has written up a
 great example that you can follow -

 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example

 Thanks,
 Neha


 On Wed, May 22, 2013 at 12:37 PM, Chris Curtin curtin.ch...@gmail.com
 wrote:

  Hi Tim,
 
 
  On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote:
 
   Hi,
  
   I'm currently trying to understand how Kafka (0.8) can scale with our
  usage
   pattern and how to setup the partitioning.
  
   We want to route the same messages belonging to the same id to the same
   queue, so its consumer will able to consume all the messages of that
 id.
  
   My questions:
  
- From my understanding, in Kafka we would need to have a custom
   partitioner that routes the same messages to the same partition right?
   I'm
   trying to find examples of writing this partitioner logic, but I can't
  find
   any. Can someone point me to an example?
  
  
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
 
  The partitioner here does a simple mod on the IP address and the # of
  partitions. You'd need to define your own logic, but this is a start.
 
 
   - I see that Kafka server.properties allows one to specify the number
 of
   partitions it supports. However, when we want to scale I wonder if we
  add #
   of partitions or # of brokers, will the same partitioner start
  distributing
   the messages to different partitions?
And if it does, how can that same consumer continue to read off the
   messages of those ids if it was interrupted in the middle?
  
 
  I'll let someone else answer this.
 
 
  
   - I'd like to create a consumer per partition, and for each one to
   subscribe to the changes of that one. How can this be done in kafka?
  
 
  Two ways: Simple Consumer or Consumer Groups:
 
  Depends on the level of control you want on code processing a specific
  partition vs. getting one assigned to it (and level of control over
 offset
  management).
 
  https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
 
 
 
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
  
 https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
 
 
  
   Thanks,
  
   Tim
  
 



Re: Apache Kafka in AWS

2013-05-22 Thread Ken Krugler
Hi Jason,

Thanks for the notes.

I'm curious whether you went with using local drives (ephemeral storage) or 
EBS, and if with EBS then what IOPS.

Thanks,

-- Ken

On May 22, 2013, at 1:42pm, Jason Weiss wrote:

 All,
 
 I asked a number of questions of the group over the last week, and I'm happy 
 to report that I've had great success getting Kafka up and running in AWS. I 
 am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra 
 Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have 
 co-located Zookeeper instances next to Zafka on each machine.
 
 I am able to publish in a repeatable fashion 273,000 events per second, with 
 each event payload consisting of a fixed size of 2048 bytes! This represents 
 the maximum throughput possible on this configuration, as the servers became 
 CPU constrained, averaging 97% utilization in a relatively flat line. This 
 isn't a burst speed – it represents a sustained throughput from 20 M1 Large 
 EC2 Kafka multi-threaded producers. Putting this into perspective, if my log 
 retention period was a month, I'd be aggregating 1.3 petabytes of data on my 
 disk drives. Suffice to say, I don't see us retaining data for more than a 
 few hours!
 
 Here were the keys to tuning for future folks to consider:
 
 First and foremost, be sure to configure your Java heap size accordingly when 
 you launch Kafka. The default is like 512MB, which in my case left virtually 
 all of my RAM inaccessible to Kafka.
 Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my 
 side, and I almost gave up on Kafka because of the problems I encountered. 
 The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning 
 in dramatic fashion. The moment I switched over to Oracle's JDK for linux, 
 Kafka didn't puke once- I mean, like not even a hiccup.
 Third know your message size. In my opinion, the more you understand about 
 your event payload characteristics, the better you can tune the system. The 
 two knobs to really turn are the log.flush.interval and 
 log.default.flush.interval.ms. The values here are intrinsically connected to 
 the types of payloads you are putting through the system.
 Fourth and finally, to maximize throughput you have to code against the async 
 paradigm, and be prepared to tweak the batch size, queue properties, and 
 compression codec (wait for it…) in a way that matches the message payload 
 you are putting through the system and the capabilities of the producer 
 system itself.
 
 
 Jason
 
 
 
 
 
 This electronic message contains information which may be confidential or 
 privileged. The information is intended for the use of the individual or 
 entity named above. If you are not the intended recipient, be aware that any 
 disclosure, copying, distribution or use of the contents of this information 
 is prohibited. If you have received this electronic transmission in error, 
 please notify us by e-mail at (postmas...@rapid7.com) immediately.

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: Apache Kafka in AWS

2013-05-22 Thread Jason Weiss
Ken,

Great question! I should have indicated I was using EBS, 500GB with 2000 
provisioned IOPs.

Jason


From: Ken Krugler [kkrugler_li...@transpac.com]
Sent: Wednesday, May 22, 2013 17:23
To: users@kafka.apache.org
Subject: Re: Apache Kafka in AWS

Hi Jason,

Thanks for the notes.

I'm curious whether you went with using local drives (ephemeral storage) or 
EBS, and if with EBS then what IOPS.

Thanks,

-- Ken

On May 22, 2013, at 1:42pm, Jason Weiss wrote:

 All,

 I asked a number of questions of the group over the last week, and I'm happy 
 to report that I've had great success getting Kafka up and running in AWS. I 
 am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra 
 Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have 
 co-located Zookeeper instances next to Zafka on each machine.

 I am able to publish in a repeatable fashion 273,000 events per second, with 
 each event payload consisting of a fixed size of 2048 bytes! This represents 
 the maximum throughput possible on this configuration, as the servers became 
 CPU constrained, averaging 97% utilization in a relatively flat line. This 
 isn't a burst speed – it represents a sustained throughput from 20 M1 Large 
 EC2 Kafka multi-threaded producers. Putting this into perspective, if my log 
 retention period was a month, I'd be aggregating 1.3 petabytes of data on my 
 disk drives. Suffice to say, I don't see us retaining data for more than a 
 few hours!

 Here were the keys to tuning for future folks to consider:

 First and foremost, be sure to configure your Java heap size accordingly when 
 you launch Kafka. The default is like 512MB, which in my case left virtually 
 all of my RAM inaccessible to Kafka.
 Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my 
 side, and I almost gave up on Kafka because of the problems I encountered. 
 The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning 
 in dramatic fashion. The moment I switched over to Oracle's JDK for linux, 
 Kafka didn't puke once- I mean, like not even a hiccup.
 Third know your message size. In my opinion, the more you understand about 
 your event payload characteristics, the better you can tune the system. The 
 two knobs to really turn are the log.flush.interval and 
 log.default.flush.interval.ms. The values here are intrinsically connected to 
 the types of payloads you are putting through the system.
 Fourth and finally, to maximize throughput you have to code against the async 
 paradigm, and be prepared to tweak the batch size, queue properties, and 
 compression codec (wait for it…) in a way that matches the message payload 
 you are putting through the system and the capabilities of the producer 
 system itself.


 Jason





 This electronic message contains information which may be confidential or 
 privileged. The information is intended for the use of the individual or 
 entity named above. If you are not the intended recipient, be aware that any 
 disclosure, copying, distribution or use of the contents of this information 
 is prohibited. If you have received this electronic transmission in error, 
 please notify us by e-mail at (postmas...@rapid7.com) immediately.

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





This electronic message contains information which may be confidential or 
privileged. The information is intended for the use of the individual or entity 
named above. If you are not the intended recipient, be aware that any 
disclosure, copying, distribution or use of the contents of this information is 
prohibited. If you have received this electronic transmission in error, please 
notify us by e-mail at (postmas...@rapid7.com) immediately.



Re: Apache Kafka in AWS

2013-05-22 Thread Jonathan Hodges
Awesome right up Jason!  Very helpful as we are also looking to build a
Kafka environment in AWS.  I am curious, are you using Kafka 0.7.2 or 0.8
in your tests?  Did you have just one EBS volume per broker instance or
RAID 10 across EBS volumes per broker?

Thanks again for the great info!

-Jonathan


On Wed, May 22, 2013 at 4:35 PM, Jason Weiss jason_we...@rapid7.com wrote:

 Ken,

 Great question! I should have indicated I was using EBS, 500GB with 2000
 provisioned IOPs.

 Jason

 
 From: Ken Krugler [kkrugler_li...@transpac.com]
 Sent: Wednesday, May 22, 2013 17:23
 To: users@kafka.apache.org
 Subject: Re: Apache Kafka in AWS

 Hi Jason,

 Thanks for the notes.

 I'm curious whether you went with using local drives (ephemeral storage)
 or EBS, and if with EBS then what IOPS.

 Thanks,

 -- Ken

 On May 22, 2013, at 1:42pm, Jason Weiss wrote:

  All,
 
  I asked a number of questions of the group over the last week, and I'm
 happy to report that I've had great success getting Kafka up and running in
 AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
 Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the
 AWS specs. I have co-located Zookeeper instances next to Zafka on each
 machine.
 
  I am able to publish in a repeatable fashion 273,000 events per second,
 with each event payload consisting of a fixed size of 2048 bytes! This
 represents the maximum throughput possible on this configuration, as the
 servers became CPU constrained, averaging 97% utilization in a relatively
 flat line. This isn't a burst speed – it represents a sustained
 throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
 this into perspective, if my log retention period was a month, I'd be
 aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
 don't see us retaining data for more than a few hours!
 
  Here were the keys to tuning for future folks to consider:
 
  First and foremost, be sure to configure your Java heap size accordingly
 when you launch Kafka. The default is like 512MB, which in my case left
 virtually all of my RAM inaccessible to Kafka.
  Second, stay away from OpenJDK. No, seriously – this was a huge thorn in
 my side, and I almost gave up on Kafka because of the problems I
 encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
 crashing and burning in dramatic fashion. The moment I switched over to
 Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
 hiccup.
  Third know your message size. In my opinion, the more you understand
 about your event payload characteristics, the better you can tune the
 system. The two knobs to really turn are the log.flush.interval and
 log.default.flush.interval.ms. The values here are intrinsically
 connected to the types of payloads you are putting through the system.
  Fourth and finally, to maximize throughput you have to code against the
 async paradigm, and be prepared to tweak the batch size, queue properties,
 and compression codec (wait for it…) in a way that matches the message
 payload you are putting through the system and the capabilities of the
 producer system itself.
 
 
  Jason
 
 
 
 
 
  This electronic message contains information which may be confidential
 or privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr





 This electronic message contains information which may be confidential or
 privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.




RE: Apache Kafka in AWS

2013-05-22 Thread Jason Weiss
Jonathan,

Using 0.7.2, with just a single EBS volume per broker instance - negative on 
the RAID 10.

I would speculate that if we used RAID 10 and we went with AWS's maximum 
provisioned IOPS (5000??) we probably could have squeaked out some more eps.

I have no doubt, BTW, that if we would have implemented this on bare metal, the 
numbers would have been substantially higher. For example, the variation 
between the 20 different identical producer clients was rather dramatic - as 
much as 5000 eps in some cases. For being identical virtualized devices, 
running identical software, configured identically from a singular AWS AMI - 
the only explanation is that the performance difference was from the tax of 
using virtualized devices.


Jason



From: Jonathan Hodges [hodg...@gmail.com]
Sent: Wednesday, May 22, 2013 19:11
To: users@kafka.apache.org
Subject: Re: Apache Kafka in AWS

Awesome right up Jason!  Very helpful as we are also looking to build a
Kafka environment in AWS.  I am curious, are you using Kafka 0.7.2 or 0.8
in your tests?  Did you have just one EBS volume per broker instance or
RAID 10 across EBS volumes per broker?

Thanks again for the great info!

-Jonathan


On Wed, May 22, 2013 at 4:35 PM, Jason Weiss jason_we...@rapid7.com wrote:

 Ken,

 Great question! I should have indicated I was using EBS, 500GB with 2000
 provisioned IOPs.

 Jason

 
 From: Ken Krugler [kkrugler_li...@transpac.com]
 Sent: Wednesday, May 22, 2013 17:23
 To: users@kafka.apache.org
 Subject: Re: Apache Kafka in AWS

 Hi Jason,

 Thanks for the notes.

 I'm curious whether you went with using local drives (ephemeral storage)
 or EBS, and if with EBS then what IOPS.

 Thanks,

 -- Ken

 On May 22, 2013, at 1:42pm, Jason Weiss wrote:

  All,
 
  I asked a number of questions of the group over the last week, and I'm
 happy to report that I've had great success getting Kafka up and running in
 AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
 Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the
 AWS specs. I have co-located Zookeeper instances next to Zafka on each
 machine.
 
  I am able to publish in a repeatable fashion 273,000 events per second,
 with each event payload consisting of a fixed size of 2048 bytes! This
 represents the maximum throughput possible on this configuration, as the
 servers became CPU constrained, averaging 97% utilization in a relatively
 flat line. This isn't a burst speed – it represents a sustained
 throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
 this into perspective, if my log retention period was a month, I'd be
 aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
 don't see us retaining data for more than a few hours!
 
  Here were the keys to tuning for future folks to consider:
 
  First and foremost, be sure to configure your Java heap size accordingly
 when you launch Kafka. The default is like 512MB, which in my case left
 virtually all of my RAM inaccessible to Kafka.
  Second, stay away from OpenJDK. No, seriously – this was a huge thorn in
 my side, and I almost gave up on Kafka because of the problems I
 encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
 crashing and burning in dramatic fashion. The moment I switched over to
 Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
 hiccup.
  Third know your message size. In my opinion, the more you understand
 about your event payload characteristics, the better you can tune the
 system. The two knobs to really turn are the log.flush.interval and
 log.default.flush.interval.ms. The values here are intrinsically
 connected to the types of payloads you are putting through the system.
  Fourth and finally, to maximize throughput you have to code against the
 async paradigm, and be prepared to tweak the batch size, queue properties,
 and compression codec (wait for it…) in a way that matches the message
 payload you are putting through the system and the capabilities of the
 producer system itself.
 
 
  Jason
 
 
 
 
 
  This electronic message contains information which may be confidential
 or privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr





 This electronic message contains information which may be confidential or
 privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, 

Re: Offset in high level consumer

2013-05-22 Thread Neha Narkhede
You can run the ConsumerOffsetChecker tool that ships with Kafka.

Thanks,
Neha


On Wed, May 22, 2013 at 2:02 PM, arathi maddula arathimadd...@gmail.comwrote:

 Hi,

 Could you tell me how to find the offset in a high level  Java consumer ?

 Thanks
 Arathi



Re: Apache Kafka in AWS

2013-05-22 Thread Scott Clasen
Hey Jason,

 question what openjdk version did you have issues with? Im running kafka
on it now and has been ok. Was it a crash only at load?

Thanks
SC


On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote:

 All,

 I asked a number of questions of the group over the last week, and I'm
 happy to report that I've had great success getting Kafka up and running in
 AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
 Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the
 AWS specs. I have co-located Zookeeper instances next to Zafka on each
 machine.

 I am able to publish in a repeatable fashion 273,000 events per second,
 with each event payload consisting of a fixed size of 2048 bytes! This
 represents the maximum throughput possible on this configuration, as the
 servers became CPU constrained, averaging 97% utilization in a relatively
 flat line. This isn't a burst speed – it represents a sustained
 throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
 this into perspective, if my log retention period was a month, I'd be
 aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
 don't see us retaining data for more than a few hours!

 Here were the keys to tuning for future folks to consider:

 First and foremost, be sure to configure your Java heap size accordingly
 when you launch Kafka. The default is like 512MB, which in my case left
 virtually all of my RAM inaccessible to Kafka.
 Second, stay away from OpenJDK. No, seriously – this was a huge thorn in
 my side, and I almost gave up on Kafka because of the problems I
 encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
 crashing and burning in dramatic fashion. The moment I switched over to
 Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
 hiccup.
 Third know your message size. In my opinion, the more you understand about
 your event payload characteristics, the better you can tune the system. The
 two knobs to really turn are the log.flush.interval and
 log.default.flush.interval.ms. The values here are intrinsically
 connected to the types of payloads you are putting through the system.
 Fourth and finally, to maximize throughput you have to code against the
 async paradigm, and be prepared to tweak the batch size, queue properties,
 and compression codec (wait for it…) in a way that matches the message
 payload you are putting through the system and the capabilities of the
 producer system itself.


 Jason





 This electronic message contains information which may be confidential or
 privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.



Re: Partitioning and scale

2013-05-22 Thread Neha Narkhede
Not automatically as of today. You have to run the reassign-partitions tool
and explicitly move selected partitions to the new brokers. If you use this
tool, you can move partitions to the new broker without any downtime.

Thanks,
Neha


On Wed, May 22, 2013 at 2:20 PM, Timothy Chen tnac...@gmail.com wrote:

 Hi Neha/Chris,

 Thanks for the reply, so if I set a fixed number of partitions and just add
 brokers to the broker pool, does it rebalance the load to the new brokers
 (along with the data)?

 Tim


 On Wed, May 22, 2013 at 1:15 PM, Neha Narkhede neha.narkh...@gmail.com
 wrote:

  - I see that Kafka server.properties allows one to specify the number of
  partitions it supports. However, when we want to scale I wonder if we
 add #
  of partitions or # of brokers, will the same partitioner start
 distributing
  the messages to different partitions?
   And if it does, how can that same consumer continue to read off the
  messages of those ids if it was interrupted in the middle?
 
  The num.partitions config in server.properties is used only for topics
 that
  are auto created (controlled by auto.create.topics.enable). For topics
 that
  you create using the admin tool, you can specify the number of partitions
  that you want. After that, currently there is no way to change that. For
  that reason, it is a good idea to over partition your topic, which also
  helps load balance partitions onto the brokers. You are right that if you
  change the number of partitions later, then previously messages that
 stuck
  to a certain partition would now get routed to a different partition,
 which
  is undesirable for applications that want to use sticky partitioning.
 
  - I'd like to create a consumer per partition, and for each one to
  subscribe to the changes of that one. How can this be done in kafka?
 
  For your use case, it seems like SimpleConsumer might be a better fit.
  However, it will require you to write code to handle discovery of leader
  for the partition that your consumer is consuming. Chris has written up a
  great example that you can follow -
 
 
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
 
  Thanks,
  Neha
 
 
  On Wed, May 22, 2013 at 12:37 PM, Chris Curtin curtin.ch...@gmail.com
  wrote:
 
   Hi Tim,
  
  
   On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com
 wrote:
  
Hi,
   
I'm currently trying to understand how Kafka (0.8) can scale with our
   usage
pattern and how to setup the partitioning.
   
We want to route the same messages belonging to the same id to the
 same
queue, so its consumer will able to consume all the messages of that
  id.
   
My questions:
   
 - From my understanding, in Kafka we would need to have a custom
partitioner that routes the same messages to the same partition
 right?
I'm
trying to find examples of writing this partitioner logic, but I
 can't
   find
any. Can someone point me to an example?
   
   
  https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
  
   The partitioner here does a simple mod on the IP address and the # of
   partitions. You'd need to define your own logic, but this is a start.
  
  
- I see that Kafka server.properties allows one to specify the number
  of
partitions it supports. However, when we want to scale I wonder if we
   add #
of partitions or # of brokers, will the same partitioner start
   distributing
the messages to different partitions?
 And if it does, how can that same consumer continue to read off the
messages of those ids if it was interrupted in the middle?
   
  
   I'll let someone else answer this.
  
  
   
- I'd like to create a consumer per partition, and for each one to
subscribe to the changes of that one. How can this be done in kafka?
   
  
   Two ways: Simple Consumer or Consumer Groups:
  
   Depends on the level of control you want on code processing a specific
   partition vs. getting one assigned to it (and level of control over
  offset
   management).
  
  
 https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
  
  
  
 
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
   
  https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
 
  
  
   
Thanks,
   
Tim
   
  
 



Re: message ordering guarantees

2013-05-22 Thread Ross Black
Thanks for the explanation.

Ross



On 23 May 2013 07:19, Neha Narkhede neha.narkh...@gmail.com wrote:

 Thanks,
 Neha
 On May 21, 2013 5:42 PM, Ross Black ross.w.bl...@gmail.com wrote:

  Hi,
 
  I am using Kafka 0.7.1, and using SyncProducer and SimpleConsumer with a
  single broker service process.
 
  I am occasionally seeing messages (from a *single* partition) being
  processed out of order to what I expect and I am trying to find where the
  problem lies.  The problem may well be in my code - I just would like to
  eliminate Kafka as a potential cause.
 
  Messages are being sent sequentially from the producer process, using a
  single SyncProducer.
  Does Kafka provide any guarantees for message ordering in this case?
 
  eg.  If The sync producer sends messages A then B then C, does the Kafka
  broker guarantee that messages will be persisted with the order A,B,C?
  If not, is there any way to ensure this ordering?
  Has anything changed in 0.8 that could be used to ensure that ordering?
 
 
  Thanks,
  Ross
 



RE: Apache Kafka in AWS

2013-05-22 Thread Jason Weiss
[ec2-user@ip-10-194-5-76 ~]$ java -version
java version 1.6.0_24
OpenJDK Runtime Environment (IcedTea6 1.11.11) 
(amazon-61.1.11.11.53.amzn1-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)


Yes, as soon as I put it under heavy load, it would buckle almost consistently. 
I knew it was JDK related because I temporarily gave up on AWS, but I was able 
to run the same code on my MacBook Pro without issue. That's when I upgraded 
AWS to Oracle Java 7 64-bit and all my crashes disappeared under load.

Jason



From: Scott Clasen [sc...@heroku.com]
Sent: Wednesday, May 22, 2013 19:27
To: users
Subject: Re: Apache Kafka in AWS

Hey Jason,

 question what openjdk version did you have issues with? Im running kafka
on it now and has been ok. Was it a crash only at load?

Thanks
SC


On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote:

 All,

 I asked a number of questions of the group over the last week, and I'm
 happy to report that I've had great success getting Kafka up and running in
 AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
 Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the
 AWS specs. I have co-located Zookeeper instances next to Zafka on each
 machine.

 I am able to publish in a repeatable fashion 273,000 events per second,
 with each event payload consisting of a fixed size of 2048 bytes! This
 represents the maximum throughput possible on this configuration, as the
 servers became CPU constrained, averaging 97% utilization in a relatively
 flat line. This isn't a burst speed – it represents a sustained
 throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
 this into perspective, if my log retention period was a month, I'd be
 aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
 don't see us retaining data for more than a few hours!

 Here were the keys to tuning for future folks to consider:

 First and foremost, be sure to configure your Java heap size accordingly
 when you launch Kafka. The default is like 512MB, which in my case left
 virtually all of my RAM inaccessible to Kafka.
 Second, stay away from OpenJDK. No, seriously – this was a huge thorn in
 my side, and I almost gave up on Kafka because of the problems I
 encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
 crashing and burning in dramatic fashion. The moment I switched over to
 Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
 hiccup.
 Third know your message size. In my opinion, the more you understand about
 your event payload characteristics, the better you can tune the system. The
 two knobs to really turn are the log.flush.interval and
 log.default.flush.interval.ms. The values here are intrinsically
 connected to the types of payloads you are putting through the system.
 Fourth and finally, to maximize throughput you have to code against the
 async paradigm, and be prepared to tweak the batch size, queue properties,
 and compression codec (wait for it…) in a way that matches the message
 payload you are putting through the system and the capabilities of the
 producer system itself.


 Jason





 This electronic message contains information which may be confidential or
 privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.

This electronic message contains information which may be confidential or 
privileged. The information is intended for the use of the individual or entity 
named above. If you are not the intended recipient, be aware that any 
disclosure, copying, distribution or use of the contents of this information is 
prohibited. If you have received this electronic transmission in error, please 
notify us by e-mail at (postmas...@rapid7.com) immediately.



Re: Apache Kafka in AWS

2013-05-22 Thread Scott Clasen
Thanks.  FWIW  this one has been fine so far

java version 1.7.0_13
OpenJDK Runtime Environment (IcedTea7 2.3.6) (Ubuntu build 1.7.0_13-b20)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

though not running at the load in your tests.


On Wed, May 22, 2013 at 4:51 PM, Jason Weiss jason_we...@rapid7.com wrote:

 [ec2-user@ip-10-194-5-76 ~]$ java -version
 java version 1.6.0_24
 OpenJDK Runtime Environment (IcedTea6 1.11.11)
 (amazon-61.1.11.11.53.amzn1-x86_64)
 OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)


 Yes, as soon as I put it under heavy load, it would buckle almost
 consistently. I knew it was JDK related because I temporarily gave up on
 AWS, but I was able to run the same code on my MacBook Pro without issue.
 That's when I upgraded AWS to Oracle Java 7 64-bit and all my crashes
 disappeared under load.

 Jason


 
 From: Scott Clasen [sc...@heroku.com]
 Sent: Wednesday, May 22, 2013 19:27
 To: users
 Subject: Re: Apache Kafka in AWS

 Hey Jason,

  question what openjdk version did you have issues with? Im running kafka
 on it now and has been ok. Was it a crash only at load?

 Thanks
 SC


 On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com
 wrote:

  All,
 
  I asked a number of questions of the group over the last week, and I'm
  happy to report that I've had great success getting Kafka up and running
 in
  AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
  Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to
 the
  AWS specs. I have co-located Zookeeper instances next to Zafka on each
  machine.
 
  I am able to publish in a repeatable fashion 273,000 events per second,
  with each event payload consisting of a fixed size of 2048 bytes! This
  represents the maximum throughput possible on this configuration, as the
  servers became CPU constrained, averaging 97% utilization in a relatively
  flat line. This isn't a burst speed – it represents a sustained
  throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
  this into perspective, if my log retention period was a month, I'd be
  aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
  don't see us retaining data for more than a few hours!
 
  Here were the keys to tuning for future folks to consider:
 
  First and foremost, be sure to configure your Java heap size accordingly
  when you launch Kafka. The default is like 512MB, which in my case left
  virtually all of my RAM inaccessible to Kafka.
  Second, stay away from OpenJDK. No, seriously – this was a huge thorn in
  my side, and I almost gave up on Kafka because of the problems I
  encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
  crashing and burning in dramatic fashion. The moment I switched over to
  Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
  hiccup.
  Third know your message size. In my opinion, the more you understand
 about
  your event payload characteristics, the better you can tune the system.
 The
  two knobs to really turn are the log.flush.interval and
  log.default.flush.interval.ms. The values here are intrinsically
  connected to the types of payloads you are putting through the system.
  Fourth and finally, to maximize throughput you have to code against the
  async paradigm, and be prepared to tweak the batch size, queue
 properties,
  and compression codec (wait for it…) in a way that matches the message
  payload you are putting through the system and the capabilities of the
  producer system itself.
 
 
  Jason
 
 
 
 
 
  This electronic message contains information which may be confidential or
  privileged. The information is intended for the use of the individual or
  entity named above. If you are not the intended recipient, be aware that
  any disclosure, copying, distribution or use of the contents of this
  information is prohibited. If you have received this electronic
  transmission in error, please notify us by e-mail at (
  postmas...@rapid7.com) immediately.
 
 This electronic message contains information which may be confidential or
 privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.




RE: Apache Kafka in AWS

2013-05-22 Thread Jason Weiss
Did you check that you were using all cores?

top was reporting over 750%

Jason


From: Ken Krugler [kkrugler_li...@transpac.com]
Sent: Wednesday, May 22, 2013 20:59
To: users@kafka.apache.org
Subject: Re: Apache Kafka in AWS

Hi Jason,

On May 22, 2013, at 3:35pm, Jason Weiss wrote:

 Ken,

 Great question! I should have indicated I was using EBS, 500GB with 2000 
 provisioned IOPs.

OK, thanks. Sounds like you were pegged on CPU usage.

But that does surprise me a bit. Did you check that you were using all cores?

Thanks,

-- Ken

PS - back in 2006 I spent a week of hell debugging an occasion job failure on 
Hadoop (this is when it was still part of Nutch). Turns out one of our 12 
slaves was accidentally using OpenJDK, and this had a JIT compiler bug that 
would occasionally rear its ugly head. Obviously the Sun/Oracle JRE isn't 
bug-free, but it gets a lot more stress testing. So one of my basic guidelines 
in the ops portion of the Hadoop class I teach is that every server must have 
exactly the same version of Oracle's JRE.

 
 From: Ken Krugler [kkrugler_li...@transpac.com]
 Sent: Wednesday, May 22, 2013 17:23
 To: users@kafka.apache.org
 Subject: Re: Apache Kafka in AWS

 Hi Jason,

 Thanks for the notes.

 I'm curious whether you went with using local drives (ephemeral storage) or 
 EBS, and if with EBS then what IOPS.

 Thanks,

 -- Ken

 On May 22, 2013, at 1:42pm, Jason Weiss wrote:

 All,

 I asked a number of questions of the group over the last week, and I'm happy 
 to report that I've had great success getting Kafka up and running in AWS. I 
 am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra 
 Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have 
 co-located Zookeeper instances next to Zafka on each machine.

 I am able to publish in a repeatable fashion 273,000 events per second, with 
 each event payload consisting of a fixed size of 2048 bytes! This represents 
 the maximum throughput possible on this configuration, as the servers became 
 CPU constrained, averaging 97% utilization in a relatively flat line. This 
 isn't a burst speed – it represents a sustained throughput from 20 M1 
 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if 
 my log retention period was a month, I'd be aggregating 1.3 petabytes of 
 data on my disk drives. Suffice to say, I don't see us retaining data for 
 more than a few hours!

 Here were the keys to tuning for future folks to consider:

 First and foremost, be sure to configure your Java heap size accordingly 
 when you launch Kafka. The default is like 512MB, which in my case left 
 virtually all of my RAM inaccessible to Kafka.
 Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my 
 side, and I almost gave up on Kafka because of the problems I encountered. 
 The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning 
 in dramatic fashion. The moment I switched over to Oracle's JDK for linux, 
 Kafka didn't puke once- I mean, like not even a hiccup.
 Third know your message size. In my opinion, the more you understand about 
 your event payload characteristics, the better you can tune the system. The 
 two knobs to really turn are the log.flush.interval and 
 log.default.flush.interval.ms. The values here are intrinsically connected 
 to the types of payloads you are putting through the system.
 Fourth and finally, to maximize throughput you have to code against the 
 async paradigm, and be prepared to tweak the batch size, queue properties, 
 and compression codec (wait for it…) in a way that matches the message 
 payload you are putting through the system and the capabilities of the 
 producer system itself.


 Jason





 This electronic message contains information which may be confidential or 
 privileged. The information is intended for the use of the individual or 
 entity named above. If you are not the intended recipient, be aware that any 
 disclosure, copying, distribution or use of the contents of this information 
 is prohibited. If you have received this electronic transmission in error, 
 please notify us by e-mail at (postmas...@rapid7.com) immediately.

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr





 This electronic message contains information which may be confidential or 
 privileged. The information is intended for the use of the individual or 
 entity named above. If you are not the intended recipient, be aware that any 
 disclosure, copying, distribution or use of the contents of this information 
 is prohibited. If you have received this electronic transmission in error, 
 please notify us by e-mail at (postmas...@rapid7.com) immediately.


--
Ken Krugler
+1 530-210-6378

Re: Apache Kafka in AWS

2013-05-22 Thread Jun Rao
Jason,

Thanks for sharing. This is very interesting. Normally, Kafka brokers don't
use too much CPU. Are most of the 750% CPU actually used by Kafka brokers?

Jun


On Wed, May 22, 2013 at 6:11 PM, Jason Weiss jason_we...@rapid7.com wrote:

 Did you check that you were using all cores?

 top was reporting over 750%

 Jason

 
 From: Ken Krugler [kkrugler_li...@transpac.com]
 Sent: Wednesday, May 22, 2013 20:59
 To: users@kafka.apache.org
 Subject: Re: Apache Kafka in AWS

 Hi Jason,

 On May 22, 2013, at 3:35pm, Jason Weiss wrote:

  Ken,
 
  Great question! I should have indicated I was using EBS, 500GB with 2000
 provisioned IOPs.

 OK, thanks. Sounds like you were pegged on CPU usage.

 But that does surprise me a bit. Did you check that you were using all
 cores?

 Thanks,

 -- Ken

 PS - back in 2006 I spent a week of hell debugging an occasion job failure
 on Hadoop (this is when it was still part of Nutch). Turns out one of our
 12 slaves was accidentally using OpenJDK, and this had a JIT compiler bug
 that would occasionally rear its ugly head. Obviously the Sun/Oracle JRE
 isn't bug-free, but it gets a lot more stress testing. So one of my basic
 guidelines in the ops portion of the Hadoop class I teach is that every
 server must have exactly the same version of Oracle's JRE.

  
  From: Ken Krugler [kkrugler_li...@transpac.com]
  Sent: Wednesday, May 22, 2013 17:23
  To: users@kafka.apache.org
  Subject: Re: Apache Kafka in AWS
 
  Hi Jason,
 
  Thanks for the notes.
 
  I'm curious whether you went with using local drives (ephemeral storage)
 or EBS, and if with EBS then what IOPS.
 
  Thanks,
 
  -- Ken
 
  On May 22, 2013, at 1:42pm, Jason Weiss wrote:
 
  All,
 
  I asked a number of questions of the group over the last week, and I'm
 happy to report that I've had great success getting Kafka up and running in
 AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory
 Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the
 AWS specs. I have co-located Zookeeper instances next to Zafka on each
 machine.
 
  I am able to publish in a repeatable fashion 273,000 events per second,
 with each event payload consisting of a fixed size of 2048 bytes! This
 represents the maximum throughput possible on this configuration, as the
 servers became CPU constrained, averaging 97% utilization in a relatively
 flat line. This isn't a burst speed – it represents a sustained
 throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting
 this into perspective, if my log retention period was a month, I'd be
 aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I
 don't see us retaining data for more than a few hours!
 
  Here were the keys to tuning for future folks to consider:
 
  First and foremost, be sure to configure your Java heap size
 accordingly when you launch Kafka. The default is like 512MB, which in my
 case left virtually all of my RAM inaccessible to Kafka.
  Second, stay away from OpenJDK. No, seriously – this was a huge thorn
 in my side, and I almost gave up on Kafka because of the problems I
 encountered. The OpenJDK NIO functions repeatedly resulted in Kafka
 crashing and burning in dramatic fashion. The moment I switched over to
 Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a
 hiccup.
  Third know your message size. In my opinion, the more you understand
 about your event payload characteristics, the better you can tune the
 system. The two knobs to really turn are the log.flush.interval and
 log.default.flush.interval.ms. The values here are intrinsically
 connected to the types of payloads you are putting through the system.
  Fourth and finally, to maximize throughput you have to code against the
 async paradigm, and be prepared to tweak the batch size, queue properties,
 and compression codec (wait for it…) in a way that matches the message
 payload you are putting through the system and the capabilities of the
 producer system itself.
 
 
  Jason
 
 
 
 
 
  This electronic message contains information which may be confidential
 or privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.
 
  --
  Ken Krugler
  +1 530-210-6378
  http://www.scaleunlimited.com
  custom big data solutions  training
  Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 
  This electronic message contains information which may be confidential
 or privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, 

large amount of disk space freed on restart

2013-05-22 Thread Jason Rosenberg
Normally, I see 2-4 log segments deleted every hour in my brokers.  I see
log lines like this:

2013-05-23 04:40:06,857  INFO [kafka-logcleaner-0] log.LogManager -
Deleting log segment 035434043157.kafka from redacted topic

However, it seems like if I restart the broker, a massive amount of disk
space is freed (without a corresponding flood of these log segment deleted
messages).  Is there an explanation for this?  Does kafka keep reference to
file segments around, and reuse them as needed or something?  And then or
restart, the references to those free segment files are dropped?

Thoughts?

This is with 0.7.2.

Jason


Re: large amount of disk space freed on restart

2013-05-22 Thread Jonathan Creasy
It isn't uncommon if a process has an open file handle on a file that is
deleted, the space is not freed until the handle is closed. So restarting
the process that has a handle on the file would cause the space to be freed
also.

You can troubleshoot that with lsof.
Normally, I see 2-4 log segments deleted every hour in my brokers.  I see
log lines like this:

2013-05-23 04:40:06,857  INFO [kafka-logcleaner-0] log.LogManager -
Deleting log segment 035434043157.kafka from redacted topic

However, it seems like if I restart the broker, a massive amount of disk
space is freed (without a corresponding flood of these log segment deleted
messages).  Is there an explanation for this?  Does kafka keep reference to
file segments around, and reuse them as needed or something?  And then or
restart, the references to those free segment files are dropped?

Thoughts?

This is with 0.7.2.

Jason


Re: large amount of disk space freed on restart

2013-05-22 Thread Jason Rosenberg
So, does this indicate kafka (or the jvm itself) is not aggressively
closing file handles of deleted files?  Is there a fix for this?  Or is
there not likely anything to be done?  What happens if the disk fills up
with file handles for phantom deleted files?

Jason


On Wed, May 22, 2013 at 9:50 PM, Jonathan Creasy j...@box.com wrote:

 It isn't uncommon if a process has an open file handle on a file that is
 deleted, the space is not freed until the handle is closed. So restarting
 the process that has a handle on the file would cause the space to be freed
 also.

 You can troubleshoot that with lsof.
 Normally, I see 2-4 log segments deleted every hour in my brokers.  I see
 log lines like this:

 2013-05-23 04:40:06,857  INFO [kafka-logcleaner-0] log.LogManager -
 Deleting log segment 035434043157.kafka from redacted topic

 However, it seems like if I restart the broker, a massive amount of disk
 space is freed (without a corresponding flood of these log segment deleted
 messages).  Is there an explanation for this?  Does kafka keep reference to
 file segments around, and reuse them as needed or something?  And then or
 restart, the references to those free segment files are dropped?

 Thoughts?

 This is with 0.7.2.

 Jason



Re: large amount of disk space freed on restart

2013-05-22 Thread Jonathan Creasy
Well, it sounds like files were deleted while Kafka still had them open. Or
something else opened them while Kafka deleted them. I haven't noticed this
on our systems but we haven't looked for it either.

Is anything outside of Kafka deleting  or reading those files?
On May 23, 2013 1:17 AM, Jason Rosenberg j...@squareup.com wrote:

 So, does this indicate kafka (or the jvm itself) is not aggressively
 closing file handles of deleted files?  Is there a fix for this?  Or is
 there not likely anything to be done?  What happens if the disk fills up
 with file handles for phantom deleted files?

 Jason


 On Wed, May 22, 2013 at 9:50 PM, Jonathan Creasy j...@box.com wrote:

 It isn't uncommon if a process has an open file handle on a file that is
 deleted, the space is not freed until the handle is closed. So restarting
 the process that has a handle on the file would cause the space to be
 freed
 also.

 You can troubleshoot that with lsof.
 Normally, I see 2-4 log segments deleted every hour in my brokers.  I see
 log lines like this:

 2013-05-23 04:40:06,857  INFO [kafka-logcleaner-0] log.LogManager -
 Deleting log segment 035434043157.kafka from redacted topic

 However, it seems like if I restart the broker, a massive amount of disk
 space is freed (without a corresponding flood of these log segment deleted
 messages).  Is there an explanation for this?  Does kafka keep reference
 to
 file segments around, and reuse them as needed or something?  And then or
 restart, the references to those free segment files are dropped?

 Thoughts?

 This is with 0.7.2.

 Jason





Re: large amount of disk space freed on restart

2013-05-22 Thread Jason Rosenberg
No, nothing outside of kafka would look at those files

I'm wondering if it's an os level thing too


On Wed, May 22, 2013 at 10:25 PM, Jonathan Creasy jcre...@box.com wrote:

 Well, it sounds like files were deleted while Kafka still had them open.
 Or something else opened them while Kafka deleted them. I haven't noticed
 this on our systems but we haven't looked for it either.

 Is anything outside of Kafka deleting  or reading those files?
 On May 23, 2013 1:17 AM, Jason Rosenberg j...@squareup.com wrote:

 So, does this indicate kafka (or the jvm itself) is not aggressively
 closing file handles of deleted files?  Is there a fix for this?  Or is
 there not likely anything to be done?  What happens if the disk fills up
 with file handles for phantom deleted files?

 Jason


 On Wed, May 22, 2013 at 9:50 PM, Jonathan Creasy j...@box.com wrote:

 It isn't uncommon if a process has an open file handle on a file that is
 deleted, the space is not freed until the handle is closed. So restarting
 the process that has a handle on the file would cause the space to be
 freed
 also.

 You can troubleshoot that with lsof.
 Normally, I see 2-4 log segments deleted every hour in my brokers.  I see
 log lines like this:

 2013-05-23 04:40:06,857  INFO [kafka-logcleaner-0] log.LogManager -
 Deleting log segment 035434043157.kafka from redacted topic

 However, it seems like if I restart the broker, a massive amount of disk
 space is freed (without a corresponding flood of these log segment
 deleted
 messages).  Is there an explanation for this?  Does kafka keep reference
 to
 file segments around, and reuse them as needed or something?  And then or
 restart, the references to those free segment files are dropped?

 Thoughts?

 This is with 0.7.2.

 Jason