orders of launching kafka servers and zookeepers
Hi, I want to launch kafka on three machines. I can launch zookeepers on the three machines first. After that, start kafka server on each machine. Or for each machine, I start a zookeeper followed by the kafka. I believe the first way is the right way to go. But I want to confirm it. Regards, Libo
Re: orders of launching kafka servers and zookeepers
First launch the zookeeper cluster completely followed by the kafka cluster. Thanks, Neha On May 22, 2013 8:43 AM, Yu, Libo libo...@citi.com wrote: Hi, I want to launch kafka on three machines. I can launch zookeepers on the three machines first. After that, start kafka server on each machine. Or for each machine, I start a zookeeper followed by the kafka. I believe the first way is the right way to go. But I want to confirm it. Regards, Libo
Partitioning and scale
Hi, I'm currently trying to understand how Kafka (0.8) can scale with our usage pattern and how to setup the partitioning. We want to route the same messages belonging to the same id to the same queue, so its consumer will able to consume all the messages of that id. My questions: - From my understanding, in Kafka we would need to have a custom partitioner that routes the same messages to the same partition right? I'm trying to find examples of writing this partitioner logic, but I can't find any. Can someone point me to an example? - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? Thanks, Tim
Re: Partitioning and scale
Hi Tim, On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote: Hi, I'm currently trying to understand how Kafka (0.8) can scale with our usage pattern and how to setup the partitioning. We want to route the same messages belonging to the same id to the same queue, so its consumer will able to consume all the messages of that id. My questions: - From my understanding, in Kafka we would need to have a custom partitioner that routes the same messages to the same partition right? I'm trying to find examples of writing this partitioner logic, but I can't find any. Can someone point me to an example? https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example The partitioner here does a simple mod on the IP address and the # of partitions. You'd need to define your own logic, but this is a start. - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? I'll let someone else answer this. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? Two ways: Simple Consumer or Consumer Groups: Depends on the level of control you want on code processing a specific partition vs. getting one assigned to it (and level of control over offset management). https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Examplehttps://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example Thanks, Tim
Re: Partitioning and scale
- I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? The num.partitions config in server.properties is used only for topics that are auto created (controlled by auto.create.topics.enable). For topics that you create using the admin tool, you can specify the number of partitions that you want. After that, currently there is no way to change that. For that reason, it is a good idea to over partition your topic, which also helps load balance partitions onto the brokers. You are right that if you change the number of partitions later, then previously messages that stuck to a certain partition would now get routed to a different partition, which is undesirable for applications that want to use sticky partitioning. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? For your use case, it seems like SimpleConsumer might be a better fit. However, it will require you to write code to handle discovery of leader for the partition that your consumer is consuming. Chris has written up a great example that you can follow - https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example Thanks, Neha On Wed, May 22, 2013 at 12:37 PM, Chris Curtin curtin.ch...@gmail.comwrote: Hi Tim, On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote: Hi, I'm currently trying to understand how Kafka (0.8) can scale with our usage pattern and how to setup the partitioning. We want to route the same messages belonging to the same id to the same queue, so its consumer will able to consume all the messages of that id. My questions: - From my understanding, in Kafka we would need to have a custom partitioner that routes the same messages to the same partition right? I'm trying to find examples of writing this partitioner logic, but I can't find any. Can someone point me to an example? https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example The partitioner here does a simple mod on the IP address and the # of partitions. You'd need to define your own logic, but this is a start. - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? I'll let someone else answer this. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? Two ways: Simple Consumer or Consumer Groups: Depends on the level of control you want on code processing a specific partition vs. getting one assigned to it (and level of control over offset management). https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example Thanks, Tim
Apache Kafka in AWS
All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately.
Re: Apache Kafka in AWS
Thanks for sharing your experience with the community, Jason! -Neha On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately.
Re: message ordering guarantees
Thanks, Neha On May 21, 2013 5:42 PM, Ross Black ross.w.bl...@gmail.com wrote: Hi, I am using Kafka 0.7.1, and using SyncProducer and SimpleConsumer with a single broker service process. I am occasionally seeing messages (from a *single* partition) being processed out of order to what I expect and I am trying to find where the problem lies. The problem may well be in my code - I just would like to eliminate Kafka as a potential cause. Messages are being sent sequentially from the producer process, using a single SyncProducer. Does Kafka provide any guarantees for message ordering in this case? eg. If The sync producer sends messages A then B then C, does the Kafka broker guarantee that messages will be persisted with the order A,B,C? If not, is there any way to ensure this ordering? Has anything changed in 0.8 that could be used to ensure that ordering? Thanks, Ross
Re: Partitioning and scale
Hi Neha/Chris, Thanks for the reply, so if I set a fixed number of partitions and just add brokers to the broker pool, does it rebalance the load to the new brokers (along with the data)? Tim On Wed, May 22, 2013 at 1:15 PM, Neha Narkhede neha.narkh...@gmail.comwrote: - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? The num.partitions config in server.properties is used only for topics that are auto created (controlled by auto.create.topics.enable). For topics that you create using the admin tool, you can specify the number of partitions that you want. After that, currently there is no way to change that. For that reason, it is a good idea to over partition your topic, which also helps load balance partitions onto the brokers. You are right that if you change the number of partitions later, then previously messages that stuck to a certain partition would now get routed to a different partition, which is undesirable for applications that want to use sticky partitioning. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? For your use case, it seems like SimpleConsumer might be a better fit. However, it will require you to write code to handle discovery of leader for the partition that your consumer is consuming. Chris has written up a great example that you can follow - https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example Thanks, Neha On Wed, May 22, 2013 at 12:37 PM, Chris Curtin curtin.ch...@gmail.com wrote: Hi Tim, On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote: Hi, I'm currently trying to understand how Kafka (0.8) can scale with our usage pattern and how to setup the partitioning. We want to route the same messages belonging to the same id to the same queue, so its consumer will able to consume all the messages of that id. My questions: - From my understanding, in Kafka we would need to have a custom partitioner that routes the same messages to the same partition right? I'm trying to find examples of writing this partitioner logic, but I can't find any. Can someone point me to an example? https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example The partitioner here does a simple mod on the IP address and the # of partitions. You'd need to define your own logic, but this is a start. - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? I'll let someone else answer this. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? Two ways: Simple Consumer or Consumer Groups: Depends on the level of control you want on code processing a specific partition vs. getting one assigned to it (and level of control over offset management). https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example Thanks, Tim
Re: Apache Kafka in AWS
Hi Jason, Thanks for the notes. I'm curious whether you went with using local drives (ephemeral storage) or EBS, and if with EBS then what IOPS. Thanks, -- Ken On May 22, 2013, at 1:42pm, Jason Weiss wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: Apache Kafka in AWS
Ken, Great question! I should have indicated I was using EBS, 500GB with 2000 provisioned IOPs. Jason From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 17:23 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, Thanks for the notes. I'm curious whether you went with using local drives (ephemeral storage) or EBS, and if with EBS then what IOPS. Thanks, -- Ken On May 22, 2013, at 1:42pm, Jason Weiss wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately.
Re: Apache Kafka in AWS
Awesome right up Jason! Very helpful as we are also looking to build a Kafka environment in AWS. I am curious, are you using Kafka 0.7.2 or 0.8 in your tests? Did you have just one EBS volume per broker instance or RAID 10 across EBS volumes per broker? Thanks again for the great info! -Jonathan On Wed, May 22, 2013 at 4:35 PM, Jason Weiss jason_we...@rapid7.com wrote: Ken, Great question! I should have indicated I was using EBS, 500GB with 2000 provisioned IOPs. Jason From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 17:23 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, Thanks for the notes. I'm curious whether you went with using local drives (ephemeral storage) or EBS, and if with EBS then what IOPS. Thanks, -- Ken On May 22, 2013, at 1:42pm, Jason Weiss wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately.
RE: Apache Kafka in AWS
Jonathan, Using 0.7.2, with just a single EBS volume per broker instance - negative on the RAID 10. I would speculate that if we used RAID 10 and we went with AWS's maximum provisioned IOPS (5000??) we probably could have squeaked out some more eps. I have no doubt, BTW, that if we would have implemented this on bare metal, the numbers would have been substantially higher. For example, the variation between the 20 different identical producer clients was rather dramatic - as much as 5000 eps in some cases. For being identical virtualized devices, running identical software, configured identically from a singular AWS AMI - the only explanation is that the performance difference was from the tax of using virtualized devices. Jason From: Jonathan Hodges [hodg...@gmail.com] Sent: Wednesday, May 22, 2013 19:11 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Awesome right up Jason! Very helpful as we are also looking to build a Kafka environment in AWS. I am curious, are you using Kafka 0.7.2 or 0.8 in your tests? Did you have just one EBS volume per broker instance or RAID 10 across EBS volumes per broker? Thanks again for the great info! -Jonathan On Wed, May 22, 2013 at 4:35 PM, Jason Weiss jason_we...@rapid7.com wrote: Ken, Great question! I should have indicated I was using EBS, 500GB with 2000 provisioned IOPs. Jason From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 17:23 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, Thanks for the notes. I'm curious whether you went with using local drives (ephemeral storage) or EBS, and if with EBS then what IOPS. Thanks, -- Ken On May 22, 2013, at 1:42pm, Jason Weiss wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient,
Re: Offset in high level consumer
You can run the ConsumerOffsetChecker tool that ships with Kafka. Thanks, Neha On Wed, May 22, 2013 at 2:02 PM, arathi maddula arathimadd...@gmail.comwrote: Hi, Could you tell me how to find the offset in a high level Java consumer ? Thanks Arathi
Re: Apache Kafka in AWS
Hey Jason, question what openjdk version did you have issues with? Im running kafka on it now and has been ok. Was it a crash only at load? Thanks SC On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately.
Re: Partitioning and scale
Not automatically as of today. You have to run the reassign-partitions tool and explicitly move selected partitions to the new brokers. If you use this tool, you can move partitions to the new broker without any downtime. Thanks, Neha On Wed, May 22, 2013 at 2:20 PM, Timothy Chen tnac...@gmail.com wrote: Hi Neha/Chris, Thanks for the reply, so if I set a fixed number of partitions and just add brokers to the broker pool, does it rebalance the load to the new brokers (along with the data)? Tim On Wed, May 22, 2013 at 1:15 PM, Neha Narkhede neha.narkh...@gmail.com wrote: - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? The num.partitions config in server.properties is used only for topics that are auto created (controlled by auto.create.topics.enable). For topics that you create using the admin tool, you can specify the number of partitions that you want. After that, currently there is no way to change that. For that reason, it is a good idea to over partition your topic, which also helps load balance partitions onto the brokers. You are right that if you change the number of partitions later, then previously messages that stuck to a certain partition would now get routed to a different partition, which is undesirable for applications that want to use sticky partitioning. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? For your use case, it seems like SimpleConsumer might be a better fit. However, it will require you to write code to handle discovery of leader for the partition that your consumer is consuming. Chris has written up a great example that you can follow - https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example Thanks, Neha On Wed, May 22, 2013 at 12:37 PM, Chris Curtin curtin.ch...@gmail.com wrote: Hi Tim, On Wed, May 22, 2013 at 3:25 PM, Timothy Chen tnac...@gmail.com wrote: Hi, I'm currently trying to understand how Kafka (0.8) can scale with our usage pattern and how to setup the partitioning. We want to route the same messages belonging to the same id to the same queue, so its consumer will able to consume all the messages of that id. My questions: - From my understanding, in Kafka we would need to have a custom partitioner that routes the same messages to the same partition right? I'm trying to find examples of writing this partitioner logic, but I can't find any. Can someone point me to an example? https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example The partitioner here does a simple mod on the IP address and the # of partitions. You'd need to define your own logic, but this is a start. - I see that Kafka server.properties allows one to specify the number of partitions it supports. However, when we want to scale I wonder if we add # of partitions or # of brokers, will the same partitioner start distributing the messages to different partitions? And if it does, how can that same consumer continue to read off the messages of those ids if it was interrupted in the middle? I'll let someone else answer this. - I'd like to create a consumer per partition, and for each one to subscribe to the changes of that one. How can this be done in kafka? Two ways: Simple Consumer or Consumer Groups: Depends on the level of control you want on code processing a specific partition vs. getting one assigned to it (and level of control over offset management). https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example Thanks, Tim
Re: message ordering guarantees
Thanks for the explanation. Ross On 23 May 2013 07:19, Neha Narkhede neha.narkh...@gmail.com wrote: Thanks, Neha On May 21, 2013 5:42 PM, Ross Black ross.w.bl...@gmail.com wrote: Hi, I am using Kafka 0.7.1, and using SyncProducer and SimpleConsumer with a single broker service process. I am occasionally seeing messages (from a *single* partition) being processed out of order to what I expect and I am trying to find where the problem lies. The problem may well be in my code - I just would like to eliminate Kafka as a potential cause. Messages are being sent sequentially from the producer process, using a single SyncProducer. Does Kafka provide any guarantees for message ordering in this case? eg. If The sync producer sends messages A then B then C, does the Kafka broker guarantee that messages will be persisted with the order A,B,C? If not, is there any way to ensure this ordering? Has anything changed in 0.8 that could be used to ensure that ordering? Thanks, Ross
RE: Apache Kafka in AWS
[ec2-user@ip-10-194-5-76 ~]$ java -version java version 1.6.0_24 OpenJDK Runtime Environment (IcedTea6 1.11.11) (amazon-61.1.11.11.53.amzn1-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Yes, as soon as I put it under heavy load, it would buckle almost consistently. I knew it was JDK related because I temporarily gave up on AWS, but I was able to run the same code on my MacBook Pro without issue. That's when I upgraded AWS to Oracle Java 7 64-bit and all my crashes disappeared under load. Jason From: Scott Clasen [sc...@heroku.com] Sent: Wednesday, May 22, 2013 19:27 To: users Subject: Re: Apache Kafka in AWS Hey Jason, question what openjdk version did you have issues with? Im running kafka on it now and has been ok. Was it a crash only at load? Thanks SC On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately. This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately.
Re: Apache Kafka in AWS
Thanks. FWIW this one has been fine so far java version 1.7.0_13 OpenJDK Runtime Environment (IcedTea7 2.3.6) (Ubuntu build 1.7.0_13-b20) OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) though not running at the load in your tests. On Wed, May 22, 2013 at 4:51 PM, Jason Weiss jason_we...@rapid7.com wrote: [ec2-user@ip-10-194-5-76 ~]$ java -version java version 1.6.0_24 OpenJDK Runtime Environment (IcedTea6 1.11.11) (amazon-61.1.11.11.53.amzn1-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Yes, as soon as I put it under heavy load, it would buckle almost consistently. I knew it was JDK related because I temporarily gave up on AWS, but I was able to run the same code on my MacBook Pro without issue. That's when I upgraded AWS to Oracle Java 7 64-bit and all my crashes disappeared under load. Jason From: Scott Clasen [sc...@heroku.com] Sent: Wednesday, May 22, 2013 19:27 To: users Subject: Re: Apache Kafka in AWS Hey Jason, question what openjdk version did you have issues with? Im running kafka on it now and has been ok. Was it a crash only at load? Thanks SC On Wed, May 22, 2013 at 1:42 PM, Jason Weiss jason_we...@rapid7.com wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately. This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately.
RE: Apache Kafka in AWS
Did you check that you were using all cores? top was reporting over 750% Jason From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 20:59 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, On May 22, 2013, at 3:35pm, Jason Weiss wrote: Ken, Great question! I should have indicated I was using EBS, 500GB with 2000 provisioned IOPs. OK, thanks. Sounds like you were pegged on CPU usage. But that does surprise me a bit. Did you check that you were using all cores? Thanks, -- Ken PS - back in 2006 I spent a week of hell debugging an occasion job failure on Hadoop (this is when it was still part of Nutch). Turns out one of our 12 slaves was accidentally using OpenJDK, and this had a JIT compiler bug that would occasionally rear its ugly head. Obviously the Sun/Oracle JRE isn't bug-free, but it gets a lot more stress testing. So one of my basic guidelines in the ops portion of the Hadoop class I teach is that every server must have exactly the same version of Oracle's JRE. From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 17:23 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, Thanks for the notes. I'm curious whether you went with using local drives (ephemeral storage) or EBS, and if with EBS then what IOPS. Thanks, -- Ken On May 22, 2013, at 1:42pm, Jason Weiss wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378
Re: Apache Kafka in AWS
Jason, Thanks for sharing. This is very interesting. Normally, Kafka brokers don't use too much CPU. Are most of the 750% CPU actually used by Kafka brokers? Jun On Wed, May 22, 2013 at 6:11 PM, Jason Weiss jason_we...@rapid7.com wrote: Did you check that you were using all cores? top was reporting over 750% Jason From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 20:59 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, On May 22, 2013, at 3:35pm, Jason Weiss wrote: Ken, Great question! I should have indicated I was using EBS, 500GB with 2000 provisioned IOPs. OK, thanks. Sounds like you were pegged on CPU usage. But that does surprise me a bit. Did you check that you were using all cores? Thanks, -- Ken PS - back in 2006 I spent a week of hell debugging an occasion job failure on Hadoop (this is when it was still part of Nutch). Turns out one of our 12 slaves was accidentally using OpenJDK, and this had a JIT compiler bug that would occasionally rear its ugly head. Obviously the Sun/Oracle JRE isn't bug-free, but it gets a lot more stress testing. So one of my basic guidelines in the ops portion of the Hadoop class I teach is that every server must have exactly the same version of Oracle's JRE. From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, May 22, 2013 17:23 To: users@kafka.apache.org Subject: Re: Apache Kafka in AWS Hi Jason, Thanks for the notes. I'm curious whether you went with using local drives (ephemeral storage) or EBS, and if with EBS then what IOPS. Thanks, -- Ken On May 22, 2013, at 1:42pm, Jason Weiss wrote: All, I asked a number of questions of the group over the last week, and I'm happy to report that I've had great success getting Kafka up and running in AWS. I am using 3 EC2 instances, each of which is a M2 High-Memory Quadruple Extra Large with 8 cores and 58.4 GiB of memory according to the AWS specs. I have co-located Zookeeper instances next to Zafka on each machine. I am able to publish in a repeatable fashion 273,000 events per second, with each event payload consisting of a fixed size of 2048 bytes! This represents the maximum throughput possible on this configuration, as the servers became CPU constrained, averaging 97% utilization in a relatively flat line. This isn't a burst speed – it represents a sustained throughput from 20 M1 Large EC2 Kafka multi-threaded producers. Putting this into perspective, if my log retention period was a month, I'd be aggregating 1.3 petabytes of data on my disk drives. Suffice to say, I don't see us retaining data for more than a few hours! Here were the keys to tuning for future folks to consider: First and foremost, be sure to configure your Java heap size accordingly when you launch Kafka. The default is like 512MB, which in my case left virtually all of my RAM inaccessible to Kafka. Second, stay away from OpenJDK. No, seriously – this was a huge thorn in my side, and I almost gave up on Kafka because of the problems I encountered. The OpenJDK NIO functions repeatedly resulted in Kafka crashing and burning in dramatic fashion. The moment I switched over to Oracle's JDK for linux, Kafka didn't puke once- I mean, like not even a hiccup. Third know your message size. In my opinion, the more you understand about your event payload characteristics, the better you can tune the system. The two knobs to really turn are the log.flush.interval and log.default.flush.interval.ms. The values here are intrinsically connected to the types of payloads you are putting through the system. Fourth and finally, to maximize throughput you have to code against the async paradigm, and be prepared to tweak the batch size, queue properties, and compression codec (wait for it…) in a way that matches the message payload you are putting through the system and the capabilities of the producer system itself. Jason This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at ( postmas...@rapid7.com) immediately. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying,
large amount of disk space freed on restart
Normally, I see 2-4 log segments deleted every hour in my brokers. I see log lines like this: 2013-05-23 04:40:06,857 INFO [kafka-logcleaner-0] log.LogManager - Deleting log segment 035434043157.kafka from redacted topic However, it seems like if I restart the broker, a massive amount of disk space is freed (without a corresponding flood of these log segment deleted messages). Is there an explanation for this? Does kafka keep reference to file segments around, and reuse them as needed or something? And then or restart, the references to those free segment files are dropped? Thoughts? This is with 0.7.2. Jason
Re: large amount of disk space freed on restart
It isn't uncommon if a process has an open file handle on a file that is deleted, the space is not freed until the handle is closed. So restarting the process that has a handle on the file would cause the space to be freed also. You can troubleshoot that with lsof. Normally, I see 2-4 log segments deleted every hour in my brokers. I see log lines like this: 2013-05-23 04:40:06,857 INFO [kafka-logcleaner-0] log.LogManager - Deleting log segment 035434043157.kafka from redacted topic However, it seems like if I restart the broker, a massive amount of disk space is freed (without a corresponding flood of these log segment deleted messages). Is there an explanation for this? Does kafka keep reference to file segments around, and reuse them as needed or something? And then or restart, the references to those free segment files are dropped? Thoughts? This is with 0.7.2. Jason
Re: large amount of disk space freed on restart
So, does this indicate kafka (or the jvm itself) is not aggressively closing file handles of deleted files? Is there a fix for this? Or is there not likely anything to be done? What happens if the disk fills up with file handles for phantom deleted files? Jason On Wed, May 22, 2013 at 9:50 PM, Jonathan Creasy j...@box.com wrote: It isn't uncommon if a process has an open file handle on a file that is deleted, the space is not freed until the handle is closed. So restarting the process that has a handle on the file would cause the space to be freed also. You can troubleshoot that with lsof. Normally, I see 2-4 log segments deleted every hour in my brokers. I see log lines like this: 2013-05-23 04:40:06,857 INFO [kafka-logcleaner-0] log.LogManager - Deleting log segment 035434043157.kafka from redacted topic However, it seems like if I restart the broker, a massive amount of disk space is freed (without a corresponding flood of these log segment deleted messages). Is there an explanation for this? Does kafka keep reference to file segments around, and reuse them as needed or something? And then or restart, the references to those free segment files are dropped? Thoughts? This is with 0.7.2. Jason
Re: large amount of disk space freed on restart
Well, it sounds like files were deleted while Kafka still had them open. Or something else opened them while Kafka deleted them. I haven't noticed this on our systems but we haven't looked for it either. Is anything outside of Kafka deleting or reading those files? On May 23, 2013 1:17 AM, Jason Rosenberg j...@squareup.com wrote: So, does this indicate kafka (or the jvm itself) is not aggressively closing file handles of deleted files? Is there a fix for this? Or is there not likely anything to be done? What happens if the disk fills up with file handles for phantom deleted files? Jason On Wed, May 22, 2013 at 9:50 PM, Jonathan Creasy j...@box.com wrote: It isn't uncommon if a process has an open file handle on a file that is deleted, the space is not freed until the handle is closed. So restarting the process that has a handle on the file would cause the space to be freed also. You can troubleshoot that with lsof. Normally, I see 2-4 log segments deleted every hour in my brokers. I see log lines like this: 2013-05-23 04:40:06,857 INFO [kafka-logcleaner-0] log.LogManager - Deleting log segment 035434043157.kafka from redacted topic However, it seems like if I restart the broker, a massive amount of disk space is freed (without a corresponding flood of these log segment deleted messages). Is there an explanation for this? Does kafka keep reference to file segments around, and reuse them as needed or something? And then or restart, the references to those free segment files are dropped? Thoughts? This is with 0.7.2. Jason
Re: large amount of disk space freed on restart
No, nothing outside of kafka would look at those files I'm wondering if it's an os level thing too On Wed, May 22, 2013 at 10:25 PM, Jonathan Creasy jcre...@box.com wrote: Well, it sounds like files were deleted while Kafka still had them open. Or something else opened them while Kafka deleted them. I haven't noticed this on our systems but we haven't looked for it either. Is anything outside of Kafka deleting or reading those files? On May 23, 2013 1:17 AM, Jason Rosenberg j...@squareup.com wrote: So, does this indicate kafka (or the jvm itself) is not aggressively closing file handles of deleted files? Is there a fix for this? Or is there not likely anything to be done? What happens if the disk fills up with file handles for phantom deleted files? Jason On Wed, May 22, 2013 at 9:50 PM, Jonathan Creasy j...@box.com wrote: It isn't uncommon if a process has an open file handle on a file that is deleted, the space is not freed until the handle is closed. So restarting the process that has a handle on the file would cause the space to be freed also. You can troubleshoot that with lsof. Normally, I see 2-4 log segments deleted every hour in my brokers. I see log lines like this: 2013-05-23 04:40:06,857 INFO [kafka-logcleaner-0] log.LogManager - Deleting log segment 035434043157.kafka from redacted topic However, it seems like if I restart the broker, a massive amount of disk space is freed (without a corresponding flood of these log segment deleted messages). Is there an explanation for this? Does kafka keep reference to file segments around, and reuse them as needed or something? And then or restart, the references to those free segment files are dropped? Thoughts? This is with 0.7.2. Jason