Kafka support

2013-04-12 Thread Milind Parikh
If a F500 company wants commercial  support for Kafka, who would they turn
to?
It appears that there seems to be natural fit with real time processing
schemes aka stormtrident.

I am sure that someone in the community must have come across this issue.
Thanks
Milind


Re: trouble loading kafka into eclipse

2013-04-12 Thread Marc Labbe
I don't know if anyone else has done that or if there is any indication
against doing it but I found that adding the sbteclipse plugin in the
project/plugins.sbt to be particularly easy to do and it worked for me. I
am only using to look/edit the code but I am not running anything from
eclipse though.

addSbtPlugin(com.typesafe.sbteclipse % sbteclipse-plugin % 2.1.1)

More information here: https://github.com/typesafehub/sbteclipse/wiki

Once setup, you run
sbt update
sbt eclipse

and then you can use Import existing project in eclipse.

You still might have to play around the conflicts be zkclient libraries but
you can manage that manually afterward.

marc


On Fri, Apr 12, 2013 at 1:11 AM, MIS misapa...@gmail.com wrote:

 here is a brief about setting up Kafka in eclipse- 3.6.2. with scala IDE
 installed as a plugin. Scala version used is 2.9

 1) follow instructions as described here:
 https://cwiki.apache.org/KAFKA/developer-setup.html, upto step 2.
 2) redirect the output of ./sbt update to some file and grep the file for
 all the jars that are required. in the build process.
 3) Copy the jars that are mentioned as part of the build process in some
 folder.
 4) then follow steps 3-6 from the link:
 https://cwiki.apache.org/KAFKA/developer-setup.html
 5) Put the jars from step 3 in the build path of eclipse for the kafka
 project, but not including the lower versions of the jars.As mentioned
 earlier there are some 102 jars. one more important thing to do is not to
 place zkClient-0.1.jar in the build path but rather choose the zkclient jar
 that is present in the lib folder.
 6) Instead of putting the scala.jar in the build path choose the scala jar
 that comes bundled with eclipse as scala plugin and add that as scala
 library.
 7) Once the above setups are done, there won't be any further build errors.
 and the unit tests can be run to get started.

 thanks,
 MIS







 On Fri, Apr 5, 2013 at 9:29 AM, Jun Rao jun...@gmail.com wrote:

  See if this thread help.
 
  Thanks,
 
  Jun
 
 
  On Thu, Apr 4, 2013 at 10:34 AM, Withers, Robert 
 robert.with...@dish.com
  wrote:
 
   I am struggling to load kafka into eclipse to get started.  I have
 tried
   to follow the instructions here:
   https://cwiki.apache.org/KAFKA/developer-setup.html, but I cannot
  connect
   to the SVN repo to check-out.  A co-worked pulled from github, but I
 seem
   to be missing a lot of jars.  This post mentions over a hundred jars
  that I
   should add to the build path:
   http://grokbase.com/t/kafka/dev/133jqejwvb/kafka-setup-in-eclipse.
Furthermore, I can only get scala 2.10 working in Juno, as the 2.9
  version
   does not seem to install correctly (I cannot find a scala project
 option
   with 2.9).
  
   Can anyone provide workable instructions for getting this puppy up and
   running?
  
   Thanks,
   rob
  
 



Re: Analysis of producer performance -- and Producer-Kafka reliability

2013-04-12 Thread Philip O'Toole
This is just my opinion of course (who else's could it be? :-)) but I think
from an engineering point of view, one must spend one's time making the
Producer-Kafka connection solid, if it is mission-critical.

Kafka is all about getting messages to disk, and assuming your disks are
solid (and 0.8 has replication) those messages are safe. To then try to
build a system to cope with the Kafka brokers being unavailable seems like
you're setting yourself for infinite regress. And to write code in the
Producer to spool to disk seems even more pointless. If you're that
worried, why not run a dedicated Kafka broker on the same node as the
Producer, and connect over localhost? To turn around and write code to
spool to disk, because the primary system that *spools to disk* is down
seems to be missing the point.

That said, even by going over local-host, I guess the network connection
could go down. In that case, Producers should buffer in RAM, and start
sending some major alerts to the Operations team. But this should almost
*never happen*. If it is happening regularly *something is fundamentally
wrong with your system design*. Those Producers should also refuse any more
incoming traffic and await intervention. Even bringing up netcat -l and
letting it suck in the data and write it to disk would work then.
Alternatives include having Producers connect to a load-balancer with
multiple Kafka brokers behind it, which helps you deal with any one Kafka
broker failing. Or just have your Producers connect directly to multiple
Kafka brokers, and switch over as needed if any one broker goes down.

I don't know if the standard Kafka producer that ships with Kafka supports
buffering in RAM in an emergency. We wrote our own that does, with a focus
on speed and simplicity, but I expect it will very rarely, if ever, buffer
in RAM.

Building and using semi-reliable system after semi-reliable system, and
chaining them all together, hoping to be more tolerant of failure is not
necessarily a good approach. Instead, identifying that one system that is
critical, and ensuring that it remains up (redundant installations,
redundant disks, redundant network connections etc) is a better approach
IMHO.

Philip


On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao jun...@gmail.com wrote:

 Another way to handle this is to provision enough client and broker servers
 so that the peak load can be handled without spooling.

 Thanks,

 Jun


 On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski pi...@liveramp.com
 wrote:

  Jun,
 
  When talking about catastrophic consequences I was actually only
  referring to the producer side. in our use case (logging requests from
  webapp servers), a spike in traffic would force us to either tolerate a
  dramatic increase in the response time, or drop messages, both of which
 are
  really undesirable. Hence the need to absorb spikes with some system on
 top
  of Kafka, unless the spooling feature mentioned by Wing (
  https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This is
  assuming there are a lot more producer machines than broker nodes, so
 each
  producer would absorb a small part of the extra load from the spike.
 
  Piotr
 
  On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao jun...@gmail.com wrote:
 
   Piotr,
  
   Actually, could you clarify what catastrophic consequences did you
 see
  on
   the broker side? Do clients timeout due to longer serving time or
  something
   else?
  
   Going forward, we plan to add per client quotas (KAFKA-656) to prevent
  the
   brokers from being overwhelmed by a runaway client.
  
   Thanks,
  
   Jun
  
  
   On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic 
   otis_gospodne...@yahoo.com wrote:
  
Hi,
   
Is there anything one can do to defend from:
   
Trying to push more data than the brokers can handle for any
 sustained
period of time has catastrophic consequences, regardless of what
  timeout
settings are used. In our use case this means that we need to either
   ensure
we have spare capacity for spikes, or use something on top of Kafka
 to
absorb spikes.
   
?
Thanks,
Otis

Performance Monitoring for Solr / ElasticSearch / HBase -
http://sematext.com/spm
   
   
   
   
   

 From: Piotr Kozikowski pi...@liveramp.com
To: users@kafka.apache.org
Sent: Tuesday, April 9, 2013 1:23 PM
Subject: Re: Analysis of producer performance

Jun,

Thank you for your comments. I'll reply point by point for clarity.

1. We were aware of the migration tool but since we haven't used
 Kafka
   for
production yet we just started using the 0.8 version directly.

2. I hadn't seen those particular slides, very interesting. I'm not
  sure
we're testing the same thing though. In our case we vary the number
 of
physical machines, but each one has 10 threads accessing a pool of
  Kafka
producer objects and in theory a single machine is 

Re: Analysis of producer performance -- and Producer-Kafka reliability

2013-04-12 Thread S Ahmed
Interesting topic.

How would buffering in RAM help in reality though (just trying to work
through the scenerio in my head):

producer tries to connect to a broker, it fails, so it appends the message
to a in-memory store.  If the broker is down for say 20 minutes and then
comes back online, won't this create problems now when the producer creates
a new message, and it has 20 minutes of backlog, and the broker now is
handling more load (assuming you are sending those in-memory messages using
a different thread)?




On Fri, Apr 12, 2013 at 11:21 AM, Philip O'Toole phi...@loggly.com wrote:

 This is just my opinion of course (who else's could it be? :-)) but I think
 from an engineering point of view, one must spend one's time making the
 Producer-Kafka connection solid, if it is mission-critical.

 Kafka is all about getting messages to disk, and assuming your disks are
 solid (and 0.8 has replication) those messages are safe. To then try to
 build a system to cope with the Kafka brokers being unavailable seems like
 you're setting yourself for infinite regress. And to write code in the
 Producer to spool to disk seems even more pointless. If you're that
 worried, why not run a dedicated Kafka broker on the same node as the
 Producer, and connect over localhost? To turn around and write code to
 spool to disk, because the primary system that *spools to disk* is down
 seems to be missing the point.

 That said, even by going over local-host, I guess the network connection
 could go down. In that case, Producers should buffer in RAM, and start
 sending some major alerts to the Operations team. But this should almost
 *never happen*. If it is happening regularly *something is fundamentally
 wrong with your system design*. Those Producers should also refuse any more
 incoming traffic and await intervention. Even bringing up netcat -l and
 letting it suck in the data and write it to disk would work then.
 Alternatives include having Producers connect to a load-balancer with
 multiple Kafka brokers behind it, which helps you deal with any one Kafka
 broker failing. Or just have your Producers connect directly to multiple
 Kafka brokers, and switch over as needed if any one broker goes down.

 I don't know if the standard Kafka producer that ships with Kafka supports
 buffering in RAM in an emergency. We wrote our own that does, with a focus
 on speed and simplicity, but I expect it will very rarely, if ever, buffer
 in RAM.

 Building and using semi-reliable system after semi-reliable system, and
 chaining them all together, hoping to be more tolerant of failure is not
 necessarily a good approach. Instead, identifying that one system that is
 critical, and ensuring that it remains up (redundant installations,
 redundant disks, redundant network connections etc) is a better approach
 IMHO.

 Philip


 On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao jun...@gmail.com wrote:

  Another way to handle this is to provision enough client and broker
 servers
  so that the peak load can be handled without spooling.
 
  Thanks,
 
  Jun
 
 
  On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski pi...@liveramp.com
  wrote:
 
   Jun,
  
   When talking about catastrophic consequences I was actually only
   referring to the producer side. in our use case (logging requests from
   webapp servers), a spike in traffic would force us to either tolerate a
   dramatic increase in the response time, or drop messages, both of which
  are
   really undesirable. Hence the need to absorb spikes with some system on
  top
   of Kafka, unless the spooling feature mentioned by Wing (
   https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This
 is
   assuming there are a lot more producer machines than broker nodes, so
  each
   producer would absorb a small part of the extra load from the spike.
  
   Piotr
  
   On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao jun...@gmail.com wrote:
  
Piotr,
   
Actually, could you clarify what catastrophic consequences did you
  see
   on
the broker side? Do clients timeout due to longer serving time or
   something
else?
   
Going forward, we plan to add per client quotas (KAFKA-656) to
 prevent
   the
brokers from being overwhelmed by a runaway client.
   
Thanks,
   
Jun
   
   
On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:
   
 Hi,

 Is there anything one can do to defend from:

 Trying to push more data than the brokers can handle for any
  sustained
 period of time has catastrophic consequences, regardless of what
   timeout
 settings are used. In our use case this means that we need to
 either
ensure
 we have spare capacity for spikes, or use something on top of Kafka
  to
 absorb spikes.

 ?
 Thanks,
 Otis
 
 Performance Monitoring for Solr / ElasticSearch / HBase -
 http://sematext.com/spm





 

Re: kafka key serializer

2013-04-12 Thread Soby Chacko
Thanks for the reply. But, when I did some more research, it seems like its
using the same encoder for both. For example, if I provide serializer.class
explicitly, this serializer is used for both key and value. However, if I
don't specify any serializer, then it appears that Kafka defaults to
DefaultEncoder. Is that what you ment?

Thanks again!!
Soby Chacko


On Wed, Apr 10, 2013 at 1:59 PM, Neha Narkhede neha.narkh...@gmail.comwrote:

 It will use DefaultEncoder.

 Thanks,
 Neha

 On Wed, Apr 10, 2013 at 8:27 AM, Soby Chacko sobycha...@gmail.com wrote:
  If I don't provide an explicit key serializer but a serializer class (for
  value encoding),  and then use a key in KeyedMessage, what will be the
  encoder used for key? Is it going to default to the same encoder used for
  value or the DefaultEncoder?
 
  Thanks,
  Soby Chacko



Re: Analysis of producer performance -- and Producer-Kafka reliability

2013-04-12 Thread Philip O'Toole
But it shouldn't almost never happen.

Obviously I mean it should almost never happen. Not shouldn't.

Philip


Broker to consumer compression

2013-04-12 Thread Pablo Barrera González
Hi

Is it possible to enable compression between the broker and the consumer?

We are thinking in develop this feature in kafka 0.7 but first I would
like to check if there is something out there.

Our escenario is like this:

- the producer is a CPU bounded machine, so we want to keep the CPU
consumption as low as possible, so we can't enable compression here
- the consumers can fetch data from the same data center (no
compression needed) or from a remote data center
- intersite bandwidth is limited so compression would be interesting

Our approach is to add compress the connection at kafka level between
broker and consumer, inside kafka, so the final user can read plain
data.

Regards

Pablo


Re: Broker to consumer compression

2013-04-12 Thread Neha Narkhede
Kafka already supports end-to-end compression which means data
transfer between brokers and consumers is compressed. There are two
supported compression codecs - GZIP and Snappy. The latter is lighter
on CPU consumption. See this blog post for comparison -
http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/

Thanks,
Neha

On Fri, Apr 12, 2013 at 10:56 AM, Pablo Barrera González
pablo.barr...@gmail.com wrote:
 Hi

 Is it possible to enable compression between the broker and the consumer?

 We are thinking in develop this feature in kafka 0.7 but first I would
 like to check if there is something out there.

 Our escenario is like this:

 - the producer is a CPU bounded machine, so we want to keep the CPU
 consumption as low as possible, so we can't enable compression here
 - the consumers can fetch data from the same data center (no
 compression needed) or from a remote data center
 - intersite bandwidth is limited so compression would be interesting

 Our approach is to add compress the connection at kafka level between
 broker and consumer, inside kafka, so the final user can read plain
 data.

 Regards

 Pablo


Re: Broker to consumer compression

2013-04-12 Thread Pablo Barrera González
Thanks for the replay Neha, but that's is end-to-end and I am looking
for a broker-consumer compression.

So:

Producer - uncompressed - broker - compressed - consumer

Regards

Pablo


2013/4/12 Neha Narkhede neha.narkh...@gmail.com:
 Kafka already supports end-to-end compression which means data
 transfer between brokers and consumers is compressed. There are two
 supported compression codecs - GZIP and Snappy. The latter is lighter
 on CPU consumption. See this blog post for comparison -
 http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/

 Thanks,
 Neha

 On Fri, Apr 12, 2013 at 10:56 AM, Pablo Barrera González
 pablo.barr...@gmail.com wrote:
 Hi

 Is it possible to enable compression between the broker and the consumer?

 We are thinking in develop this feature in kafka 0.7 but first I would
 like to check if there is something out there.

 Our escenario is like this:

 - the producer is a CPU bounded machine, so we want to keep the CPU
 consumption as low as possible, so we can't enable compression here
 - the consumers can fetch data from the same data center (no
 compression needed) or from a remote data center
 - intersite bandwidth is limited so compression would be interesting

 Our approach is to add compress the connection at kafka level between
 broker and consumer, inside kafka, so the final user can read plain
 data.

 Regards

 Pablo


Re: Broker to consumer compression

2013-04-12 Thread Neha Narkhede
That is not available for performance reasons. Broker uses zero-copy
to transfer data from disk to the network on the consumer side. If we
post process data already written to disk before sending it to
consumer, we will lose the performance advantage that we have due to
zero copy.

Thanks,
Neha

On Fri, Apr 12, 2013 at 12:59 PM, Pablo Barrera González
pablo.barr...@gmail.com wrote:
 Thanks for the replay Neha, but that's is end-to-end and I am looking
 for a broker-consumer compression.

 So:

 Producer - uncompressed - broker - compressed - consumer

 Regards

 Pablo


 2013/4/12 Neha Narkhede neha.narkh...@gmail.com:
 Kafka already supports end-to-end compression which means data
 transfer between brokers and consumers is compressed. There are two
 supported compression codecs - GZIP and Snappy. The latter is lighter
 on CPU consumption. See this blog post for comparison -
 http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/

 Thanks,
 Neha

 On Fri, Apr 12, 2013 at 10:56 AM, Pablo Barrera González
 pablo.barr...@gmail.com wrote:
 Hi

 Is it possible to enable compression between the broker and the consumer?

 We are thinking in develop this feature in kafka 0.7 but first I would
 like to check if there is something out there.

 Our escenario is like this:

 - the producer is a CPU bounded machine, so we want to keep the CPU
 consumption as low as possible, so we can't enable compression here
 - the consumers can fetch data from the same data center (no
 compression needed) or from a remote data center
 - intersite bandwidth is limited so compression would be interesting

 Our approach is to add compress the connection at kafka level between
 broker and consumer, inside kafka, so the final user can read plain
 data.

 Regards

 Pablo


Re: Not balancing across multiple brokers

2013-04-12 Thread Neha Narkhede
Do you use a VIP or zookeeper for producer side load balancing ? In
other words, what are the values you override for broker.list and
zk.connect in the producer config ?

Thanks,
Neha

On Fri, Apr 12, 2013 at 12:16 PM, Tom Brown tombrow...@gmail.com wrote:
 We have recently setup a new kafka (0.7.1) cluster with two brokers. Each
 topic has 2 partitions per server. We have a two processes that that write
 to the cluster using the class: kafka.javaapi.producer.Producer.Producer.

 The problem is that the first process only writes to the first broker. The
 second process (using the exact same code to perform the write)
 successfully writes to both brokers.

 How can I identify the cause of the imbalance in the first process? How
 does the Producer decide which broker is the recipient of each message?

 Thanks!

 --Tom


Re: Not balancing across multiple brokers

2013-04-12 Thread Tom Brown
In the producer config, we use the zk connect string:
zk001,zk002,zk003/kafka.

Both brokers have registered themselves with zookeeper. Because only the
first broker has ever received any writes, only the first broker is
registered for the topic in question.

--Tom


On Fri, Apr 12, 2013 at 3:32 PM, Neha Narkhede neha.narkh...@gmail.comwrote:

 Do you use a VIP or zookeeper for producer side load balancing ? In
 other words, what are the values you override for broker.list and
 zk.connect in the producer config ?

 Thanks,
 Neha

 On Fri, Apr 12, 2013 at 12:16 PM, Tom Brown tombrow...@gmail.com wrote:
  We have recently setup a new kafka (0.7.1) cluster with two brokers. Each
  topic has 2 partitions per server. We have a two processes that that
 write
  to the cluster using the class: kafka.javaapi.producer.Producer.Producer.
 
  The problem is that the first process only writes to the first broker.
 The
  second process (using the exact same code to perform the write)
  successfully writes to both brokers.
 
  How can I identify the cause of the imbalance in the first process? How
  does the Producer decide which broker is the recipient of each message?
 
  Thanks!
 
  --Tom



Re: Analysis of producer performance

2013-04-12 Thread Piotr Kozikowski
Hi all,

I posted an update on the post (
https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/) to
test the effect of disabling ack messages from brokers. It appears this
only makes a big difference (~2x improvement ) when using synthetic log
messages, but only a modest 12% improvement when using real production
messages. This is using GZIP compression. The way I interpret this is that
just turning acks off is not enough to mimic the 0.7 behavior because GZIP
consumes significant CPU time and since the brokers now need to decompress
data, there is a hit on throughput even without acks. Does this sound
reasonable?

Thanks,

Piotr

On Mon, Apr 8, 2013 at 4:42 PM, Piotr Kozikowski pi...@liveramp.com wrote:

 Hi,

 At LiveRamp we are considering replacing Scribe with Kafka, and as a first
 step we run some tests to evaluate producer performance. You can find our
 preliminary results here:
 https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/.
 We hope this will be useful for some folks, and If anyone has comments or
 suggestions about what to do differently to obtain better results your
 feedback will be very welcome.

 Thanks,

 Piotr



Re: trouble loading kafka into eclipse

2013-04-12 Thread Jun Rao
Thanks,

Jun


On Fri, Apr 12, 2013 at 1:08 PM, Marc Labbe mrla...@gmail.com wrote:

 I updated the Developer setup page. Let me know if it's not clear enough or
 if I need to change anything.

 On another note, since the idea plugin is already there, would it be
 possible to add the sbteclipse plugin permanently as well?


 On Fri, Apr 12, 2013 at 10:52 AM, Jun Rao jun...@gmail.com wrote:

  MIS, Marc,
 
  Thanks for the update. Could you put those notes to that wiki?
 
  Jun
 
 
  On Thu, Apr 11, 2013 at 10:11 PM, MIS misapa...@gmail.com wrote:
 
   here is a brief about setting up Kafka in eclipse- 3.6.2. with scala
 IDE
   installed as a plugin. Scala version used is 2.9
  
   1) follow instructions as described here:
   https://cwiki.apache.org/KAFKA/developer-setup.html, upto step 2.
   2) redirect the output of ./sbt update to some file and grep the file
 for
   all the jars that are required. in the build process.
   3) Copy the jars that are mentioned as part of the build process in
 some
   folder.
   4) then follow steps 3-6 from the link:
   https://cwiki.apache.org/KAFKA/developer-setup.html
   5) Put the jars from step 3 in the build path of eclipse for the kafka
   project, but not including the lower versions of the jars.As mentioned
   earlier there are some 102 jars. one more important thing to do is not
 to
   place zkClient-0.1.jar in the build path but rather choose the zkclient
  jar
   that is present in the lib folder.
   6) Instead of putting the scala.jar in the build path choose the scala
  jar
   that comes bundled with eclipse as scala plugin and add that as scala
   library.
   7) Once the above setups are done, there won't be any further build
  errors.
   and the unit tests can be run to get started.
  
   thanks,
   MIS
  
  
  
  
  
  
  
   On Fri, Apr 5, 2013 at 9:29 AM, Jun Rao jun...@gmail.com wrote:
  
See if this thread help.
   
Thanks,
   
Jun
   
   
On Thu, Apr 4, 2013 at 10:34 AM, Withers, Robert 
   robert.with...@dish.com
wrote:
   
 I am struggling to load kafka into eclipse to get started.  I have
   tried
 to follow the instructions here:
 https://cwiki.apache.org/KAFKA/developer-setup.html, but I cannot
connect
 to the SVN repo to check-out.  A co-worked pulled from github, but
 I
   seem
 to be missing a lot of jars.  This post mentions over a hundred
 jars
that I
 should add to the build path:
 http://grokbase.com/t/kafka/dev/133jqejwvb/kafka-setup-in-eclipse.
  Furthermore, I can only get scala 2.10 working in Juno, as the 2.9
version
 does not seem to install correctly (I cannot find a scala project
   option
 with 2.9).

 Can anyone provide workable instructions for getting this puppy up
  and
 running?

 Thanks,
 rob