Kafka support
If a F500 company wants commercial support for Kafka, who would they turn to? It appears that there seems to be natural fit with real time processing schemes aka stormtrident. I am sure that someone in the community must have come across this issue. Thanks Milind
Re: trouble loading kafka into eclipse
I don't know if anyone else has done that or if there is any indication against doing it but I found that adding the sbteclipse plugin in the project/plugins.sbt to be particularly easy to do and it worked for me. I am only using to look/edit the code but I am not running anything from eclipse though. addSbtPlugin(com.typesafe.sbteclipse % sbteclipse-plugin % 2.1.1) More information here: https://github.com/typesafehub/sbteclipse/wiki Once setup, you run sbt update sbt eclipse and then you can use Import existing project in eclipse. You still might have to play around the conflicts be zkclient libraries but you can manage that manually afterward. marc On Fri, Apr 12, 2013 at 1:11 AM, MIS misapa...@gmail.com wrote: here is a brief about setting up Kafka in eclipse- 3.6.2. with scala IDE installed as a plugin. Scala version used is 2.9 1) follow instructions as described here: https://cwiki.apache.org/KAFKA/developer-setup.html, upto step 2. 2) redirect the output of ./sbt update to some file and grep the file for all the jars that are required. in the build process. 3) Copy the jars that are mentioned as part of the build process in some folder. 4) then follow steps 3-6 from the link: https://cwiki.apache.org/KAFKA/developer-setup.html 5) Put the jars from step 3 in the build path of eclipse for the kafka project, but not including the lower versions of the jars.As mentioned earlier there are some 102 jars. one more important thing to do is not to place zkClient-0.1.jar in the build path but rather choose the zkclient jar that is present in the lib folder. 6) Instead of putting the scala.jar in the build path choose the scala jar that comes bundled with eclipse as scala plugin and add that as scala library. 7) Once the above setups are done, there won't be any further build errors. and the unit tests can be run to get started. thanks, MIS On Fri, Apr 5, 2013 at 9:29 AM, Jun Rao jun...@gmail.com wrote: See if this thread help. Thanks, Jun On Thu, Apr 4, 2013 at 10:34 AM, Withers, Robert robert.with...@dish.com wrote: I am struggling to load kafka into eclipse to get started. I have tried to follow the instructions here: https://cwiki.apache.org/KAFKA/developer-setup.html, but I cannot connect to the SVN repo to check-out. A co-worked pulled from github, but I seem to be missing a lot of jars. This post mentions over a hundred jars that I should add to the build path: http://grokbase.com/t/kafka/dev/133jqejwvb/kafka-setup-in-eclipse. Furthermore, I can only get scala 2.10 working in Juno, as the 2.9 version does not seem to install correctly (I cannot find a scala project option with 2.9). Can anyone provide workable instructions for getting this puppy up and running? Thanks, rob
Re: Analysis of producer performance -- and Producer-Kafka reliability
This is just my opinion of course (who else's could it be? :-)) but I think from an engineering point of view, one must spend one's time making the Producer-Kafka connection solid, if it is mission-critical. Kafka is all about getting messages to disk, and assuming your disks are solid (and 0.8 has replication) those messages are safe. To then try to build a system to cope with the Kafka brokers being unavailable seems like you're setting yourself for infinite regress. And to write code in the Producer to spool to disk seems even more pointless. If you're that worried, why not run a dedicated Kafka broker on the same node as the Producer, and connect over localhost? To turn around and write code to spool to disk, because the primary system that *spools to disk* is down seems to be missing the point. That said, even by going over local-host, I guess the network connection could go down. In that case, Producers should buffer in RAM, and start sending some major alerts to the Operations team. But this should almost *never happen*. If it is happening regularly *something is fundamentally wrong with your system design*. Those Producers should also refuse any more incoming traffic and await intervention. Even bringing up netcat -l and letting it suck in the data and write it to disk would work then. Alternatives include having Producers connect to a load-balancer with multiple Kafka brokers behind it, which helps you deal with any one Kafka broker failing. Or just have your Producers connect directly to multiple Kafka brokers, and switch over as needed if any one broker goes down. I don't know if the standard Kafka producer that ships with Kafka supports buffering in RAM in an emergency. We wrote our own that does, with a focus on speed and simplicity, but I expect it will very rarely, if ever, buffer in RAM. Building and using semi-reliable system after semi-reliable system, and chaining them all together, hoping to be more tolerant of failure is not necessarily a good approach. Instead, identifying that one system that is critical, and ensuring that it remains up (redundant installations, redundant disks, redundant network connections etc) is a better approach IMHO. Philip On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao jun...@gmail.com wrote: Another way to handle this is to provision enough client and broker servers so that the peak load can be handled without spooling. Thanks, Jun On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski pi...@liveramp.com wrote: Jun, When talking about catastrophic consequences I was actually only referring to the producer side. in our use case (logging requests from webapp servers), a spike in traffic would force us to either tolerate a dramatic increase in the response time, or drop messages, both of which are really undesirable. Hence the need to absorb spikes with some system on top of Kafka, unless the spooling feature mentioned by Wing ( https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This is assuming there are a lot more producer machines than broker nodes, so each producer would absorb a small part of the extra load from the spike. Piotr On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao jun...@gmail.com wrote: Piotr, Actually, could you clarify what catastrophic consequences did you see on the broker side? Do clients timeout due to longer serving time or something else? Going forward, we plan to add per client quotas (KAFKA-656) to prevent the brokers from being overwhelmed by a runaway client. Thanks, Jun On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Is there anything one can do to defend from: Trying to push more data than the brokers can handle for any sustained period of time has catastrophic consequences, regardless of what timeout settings are used. In our use case this means that we need to either ensure we have spare capacity for spikes, or use something on top of Kafka to absorb spikes. ? Thanks, Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Piotr Kozikowski pi...@liveramp.com To: users@kafka.apache.org Sent: Tuesday, April 9, 2013 1:23 PM Subject: Re: Analysis of producer performance Jun, Thank you for your comments. I'll reply point by point for clarity. 1. We were aware of the migration tool but since we haven't used Kafka for production yet we just started using the 0.8 version directly. 2. I hadn't seen those particular slides, very interesting. I'm not sure we're testing the same thing though. In our case we vary the number of physical machines, but each one has 10 threads accessing a pool of Kafka producer objects and in theory a single machine is
Re: Analysis of producer performance -- and Producer-Kafka reliability
Interesting topic. How would buffering in RAM help in reality though (just trying to work through the scenerio in my head): producer tries to connect to a broker, it fails, so it appends the message to a in-memory store. If the broker is down for say 20 minutes and then comes back online, won't this create problems now when the producer creates a new message, and it has 20 minutes of backlog, and the broker now is handling more load (assuming you are sending those in-memory messages using a different thread)? On Fri, Apr 12, 2013 at 11:21 AM, Philip O'Toole phi...@loggly.com wrote: This is just my opinion of course (who else's could it be? :-)) but I think from an engineering point of view, one must spend one's time making the Producer-Kafka connection solid, if it is mission-critical. Kafka is all about getting messages to disk, and assuming your disks are solid (and 0.8 has replication) those messages are safe. To then try to build a system to cope with the Kafka brokers being unavailable seems like you're setting yourself for infinite regress. And to write code in the Producer to spool to disk seems even more pointless. If you're that worried, why not run a dedicated Kafka broker on the same node as the Producer, and connect over localhost? To turn around and write code to spool to disk, because the primary system that *spools to disk* is down seems to be missing the point. That said, even by going over local-host, I guess the network connection could go down. In that case, Producers should buffer in RAM, and start sending some major alerts to the Operations team. But this should almost *never happen*. If it is happening regularly *something is fundamentally wrong with your system design*. Those Producers should also refuse any more incoming traffic and await intervention. Even bringing up netcat -l and letting it suck in the data and write it to disk would work then. Alternatives include having Producers connect to a load-balancer with multiple Kafka brokers behind it, which helps you deal with any one Kafka broker failing. Or just have your Producers connect directly to multiple Kafka brokers, and switch over as needed if any one broker goes down. I don't know if the standard Kafka producer that ships with Kafka supports buffering in RAM in an emergency. We wrote our own that does, with a focus on speed and simplicity, but I expect it will very rarely, if ever, buffer in RAM. Building and using semi-reliable system after semi-reliable system, and chaining them all together, hoping to be more tolerant of failure is not necessarily a good approach. Instead, identifying that one system that is critical, and ensuring that it remains up (redundant installations, redundant disks, redundant network connections etc) is a better approach IMHO. Philip On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao jun...@gmail.com wrote: Another way to handle this is to provision enough client and broker servers so that the peak load can be handled without spooling. Thanks, Jun On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski pi...@liveramp.com wrote: Jun, When talking about catastrophic consequences I was actually only referring to the producer side. in our use case (logging requests from webapp servers), a spike in traffic would force us to either tolerate a dramatic increase in the response time, or drop messages, both of which are really undesirable. Hence the need to absorb spikes with some system on top of Kafka, unless the spooling feature mentioned by Wing ( https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This is assuming there are a lot more producer machines than broker nodes, so each producer would absorb a small part of the extra load from the spike. Piotr On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao jun...@gmail.com wrote: Piotr, Actually, could you clarify what catastrophic consequences did you see on the broker side? Do clients timeout due to longer serving time or something else? Going forward, we plan to add per client quotas (KAFKA-656) to prevent the brokers from being overwhelmed by a runaway client. Thanks, Jun On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Is there anything one can do to defend from: Trying to push more data than the brokers can handle for any sustained period of time has catastrophic consequences, regardless of what timeout settings are used. In our use case this means that we need to either ensure we have spare capacity for spikes, or use something on top of Kafka to absorb spikes. ? Thanks, Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
Re: kafka key serializer
Thanks for the reply. But, when I did some more research, it seems like its using the same encoder for both. For example, if I provide serializer.class explicitly, this serializer is used for both key and value. However, if I don't specify any serializer, then it appears that Kafka defaults to DefaultEncoder. Is that what you ment? Thanks again!! Soby Chacko On Wed, Apr 10, 2013 at 1:59 PM, Neha Narkhede neha.narkh...@gmail.comwrote: It will use DefaultEncoder. Thanks, Neha On Wed, Apr 10, 2013 at 8:27 AM, Soby Chacko sobycha...@gmail.com wrote: If I don't provide an explicit key serializer but a serializer class (for value encoding), and then use a key in KeyedMessage, what will be the encoder used for key? Is it going to default to the same encoder used for value or the DefaultEncoder? Thanks, Soby Chacko
Re: Analysis of producer performance -- and Producer-Kafka reliability
But it shouldn't almost never happen. Obviously I mean it should almost never happen. Not shouldn't. Philip
Broker to consumer compression
Hi Is it possible to enable compression between the broker and the consumer? We are thinking in develop this feature in kafka 0.7 but first I would like to check if there is something out there. Our escenario is like this: - the producer is a CPU bounded machine, so we want to keep the CPU consumption as low as possible, so we can't enable compression here - the consumers can fetch data from the same data center (no compression needed) or from a remote data center - intersite bandwidth is limited so compression would be interesting Our approach is to add compress the connection at kafka level between broker and consumer, inside kafka, so the final user can read plain data. Regards Pablo
Re: Broker to consumer compression
Kafka already supports end-to-end compression which means data transfer between brokers and consumers is compressed. There are two supported compression codecs - GZIP and Snappy. The latter is lighter on CPU consumption. See this blog post for comparison - http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/ Thanks, Neha On Fri, Apr 12, 2013 at 10:56 AM, Pablo Barrera González pablo.barr...@gmail.com wrote: Hi Is it possible to enable compression between the broker and the consumer? We are thinking in develop this feature in kafka 0.7 but first I would like to check if there is something out there. Our escenario is like this: - the producer is a CPU bounded machine, so we want to keep the CPU consumption as low as possible, so we can't enable compression here - the consumers can fetch data from the same data center (no compression needed) or from a remote data center - intersite bandwidth is limited so compression would be interesting Our approach is to add compress the connection at kafka level between broker and consumer, inside kafka, so the final user can read plain data. Regards Pablo
Re: Broker to consumer compression
Thanks for the replay Neha, but that's is end-to-end and I am looking for a broker-consumer compression. So: Producer - uncompressed - broker - compressed - consumer Regards Pablo 2013/4/12 Neha Narkhede neha.narkh...@gmail.com: Kafka already supports end-to-end compression which means data transfer between brokers and consumers is compressed. There are two supported compression codecs - GZIP and Snappy. The latter is lighter on CPU consumption. See this blog post for comparison - http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/ Thanks, Neha On Fri, Apr 12, 2013 at 10:56 AM, Pablo Barrera González pablo.barr...@gmail.com wrote: Hi Is it possible to enable compression between the broker and the consumer? We are thinking in develop this feature in kafka 0.7 but first I would like to check if there is something out there. Our escenario is like this: - the producer is a CPU bounded machine, so we want to keep the CPU consumption as low as possible, so we can't enable compression here - the consumers can fetch data from the same data center (no compression needed) or from a remote data center - intersite bandwidth is limited so compression would be interesting Our approach is to add compress the connection at kafka level between broker and consumer, inside kafka, so the final user can read plain data. Regards Pablo
Re: Broker to consumer compression
That is not available for performance reasons. Broker uses zero-copy to transfer data from disk to the network on the consumer side. If we post process data already written to disk before sending it to consumer, we will lose the performance advantage that we have due to zero copy. Thanks, Neha On Fri, Apr 12, 2013 at 12:59 PM, Pablo Barrera González pablo.barr...@gmail.com wrote: Thanks for the replay Neha, but that's is end-to-end and I am looking for a broker-consumer compression. So: Producer - uncompressed - broker - compressed - consumer Regards Pablo 2013/4/12 Neha Narkhede neha.narkh...@gmail.com: Kafka already supports end-to-end compression which means data transfer between brokers and consumers is compressed. There are two supported compression codecs - GZIP and Snappy. The latter is lighter on CPU consumption. See this blog post for comparison - http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/ Thanks, Neha On Fri, Apr 12, 2013 at 10:56 AM, Pablo Barrera González pablo.barr...@gmail.com wrote: Hi Is it possible to enable compression between the broker and the consumer? We are thinking in develop this feature in kafka 0.7 but first I would like to check if there is something out there. Our escenario is like this: - the producer is a CPU bounded machine, so we want to keep the CPU consumption as low as possible, so we can't enable compression here - the consumers can fetch data from the same data center (no compression needed) or from a remote data center - intersite bandwidth is limited so compression would be interesting Our approach is to add compress the connection at kafka level between broker and consumer, inside kafka, so the final user can read plain data. Regards Pablo
Re: Not balancing across multiple brokers
Do you use a VIP or zookeeper for producer side load balancing ? In other words, what are the values you override for broker.list and zk.connect in the producer config ? Thanks, Neha On Fri, Apr 12, 2013 at 12:16 PM, Tom Brown tombrow...@gmail.com wrote: We have recently setup a new kafka (0.7.1) cluster with two brokers. Each topic has 2 partitions per server. We have a two processes that that write to the cluster using the class: kafka.javaapi.producer.Producer.Producer. The problem is that the first process only writes to the first broker. The second process (using the exact same code to perform the write) successfully writes to both brokers. How can I identify the cause of the imbalance in the first process? How does the Producer decide which broker is the recipient of each message? Thanks! --Tom
Re: Not balancing across multiple brokers
In the producer config, we use the zk connect string: zk001,zk002,zk003/kafka. Both brokers have registered themselves with zookeeper. Because only the first broker has ever received any writes, only the first broker is registered for the topic in question. --Tom On Fri, Apr 12, 2013 at 3:32 PM, Neha Narkhede neha.narkh...@gmail.comwrote: Do you use a VIP or zookeeper for producer side load balancing ? In other words, what are the values you override for broker.list and zk.connect in the producer config ? Thanks, Neha On Fri, Apr 12, 2013 at 12:16 PM, Tom Brown tombrow...@gmail.com wrote: We have recently setup a new kafka (0.7.1) cluster with two brokers. Each topic has 2 partitions per server. We have a two processes that that write to the cluster using the class: kafka.javaapi.producer.Producer.Producer. The problem is that the first process only writes to the first broker. The second process (using the exact same code to perform the write) successfully writes to both brokers. How can I identify the cause of the imbalance in the first process? How does the Producer decide which broker is the recipient of each message? Thanks! --Tom
Re: Analysis of producer performance
Hi all, I posted an update on the post ( https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/) to test the effect of disabling ack messages from brokers. It appears this only makes a big difference (~2x improvement ) when using synthetic log messages, but only a modest 12% improvement when using real production messages. This is using GZIP compression. The way I interpret this is that just turning acks off is not enough to mimic the 0.7 behavior because GZIP consumes significant CPU time and since the brokers now need to decompress data, there is a hit on throughput even without acks. Does this sound reasonable? Thanks, Piotr On Mon, Apr 8, 2013 at 4:42 PM, Piotr Kozikowski pi...@liveramp.com wrote: Hi, At LiveRamp we are considering replacing Scribe with Kafka, and as a first step we run some tests to evaluate producer performance. You can find our preliminary results here: https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/. We hope this will be useful for some folks, and If anyone has comments or suggestions about what to do differently to obtain better results your feedback will be very welcome. Thanks, Piotr
Re: trouble loading kafka into eclipse
Thanks, Jun On Fri, Apr 12, 2013 at 1:08 PM, Marc Labbe mrla...@gmail.com wrote: I updated the Developer setup page. Let me know if it's not clear enough or if I need to change anything. On another note, since the idea plugin is already there, would it be possible to add the sbteclipse plugin permanently as well? On Fri, Apr 12, 2013 at 10:52 AM, Jun Rao jun...@gmail.com wrote: MIS, Marc, Thanks for the update. Could you put those notes to that wiki? Jun On Thu, Apr 11, 2013 at 10:11 PM, MIS misapa...@gmail.com wrote: here is a brief about setting up Kafka in eclipse- 3.6.2. with scala IDE installed as a plugin. Scala version used is 2.9 1) follow instructions as described here: https://cwiki.apache.org/KAFKA/developer-setup.html, upto step 2. 2) redirect the output of ./sbt update to some file and grep the file for all the jars that are required. in the build process. 3) Copy the jars that are mentioned as part of the build process in some folder. 4) then follow steps 3-6 from the link: https://cwiki.apache.org/KAFKA/developer-setup.html 5) Put the jars from step 3 in the build path of eclipse for the kafka project, but not including the lower versions of the jars.As mentioned earlier there are some 102 jars. one more important thing to do is not to place zkClient-0.1.jar in the build path but rather choose the zkclient jar that is present in the lib folder. 6) Instead of putting the scala.jar in the build path choose the scala jar that comes bundled with eclipse as scala plugin and add that as scala library. 7) Once the above setups are done, there won't be any further build errors. and the unit tests can be run to get started. thanks, MIS On Fri, Apr 5, 2013 at 9:29 AM, Jun Rao jun...@gmail.com wrote: See if this thread help. Thanks, Jun On Thu, Apr 4, 2013 at 10:34 AM, Withers, Robert robert.with...@dish.com wrote: I am struggling to load kafka into eclipse to get started. I have tried to follow the instructions here: https://cwiki.apache.org/KAFKA/developer-setup.html, but I cannot connect to the SVN repo to check-out. A co-worked pulled from github, but I seem to be missing a lot of jars. This post mentions over a hundred jars that I should add to the build path: http://grokbase.com/t/kafka/dev/133jqejwvb/kafka-setup-in-eclipse. Furthermore, I can only get scala 2.10 working in Juno, as the 2.9 version does not seem to install correctly (I cannot find a scala project option with 2.9). Can anyone provide workable instructions for getting this puppy up and running? Thanks, rob