It seems there are two underlying things here: storing messages to stable storage, and making messages available to consumers (i.e., storing messages on the broker). One can be achieved simply and reliably by spooling to local disk, the other requires network and is inherently less reliable. Buffering messages in memory does not help with the first one since they are in volatile storage, but it does help with the second one in the event of a network partition.

I could imagine a producer running in "ultra-reliability" mode where it uses a local log file as a buffer where all messages written to and read from. One issue with this, though, is that now you have to worry about the performance and capacity of the disks on your producers (which can be numerous compared to brokers). As for performance, the data being written by producers is already in active memory, so writing it to a disk then doing a zero-copy transfer to the network should be pretty fast (maybe?).

Or, Kafka can remain more "protocol-ish" and less "application-y" and just give you errors when brokers are unavailable and let your application deal with it. This is basically what TCP/HTTP/etc do. HTTP servers don't say "hold on, there's a problem, let me try that request again in a second.."

Interesting discussion, btw :)
-David

On 4/15/13 2:18 PM, Piotr Kozikowski wrote:
Philip,

We would not use spooling to local disk on the producer to deal with
problems with the connection to the brokers, but rather to absorb temporary
spikes in traffic that would overwhelm the brokers. This is assuming that
1) those spikes are relatively short, but when they come they require much
higher throughput than normal (otherwise we'd just have a capacity problem
and would need more brokers), and 2) the spikes are long enough for just a
RAM buffer to be dangerous. If the brokers did go down, spooling to disk
would give us more time to react, but that's not the primary reason for
wanting the feature.

-Piotr

On Fri, Apr 12, 2013 at 8:21 AM, Philip O'Toole <phi...@loggly.com> wrote:

This is just my opinion of course (who else's could it be? :-)) but I think
from an engineering point of view, one must spend one's time making the
Producer-Kafka connection solid, if it is mission-critical.

Kafka is all about getting messages to disk, and assuming your disks are
solid (and 0.8 has replication) those messages are safe. To then try to
build a system to cope with the Kafka brokers being unavailable seems like
you're setting yourself for infinite regress. And to write code in the
Producer to spool to disk seems even more pointless. If you're that
worried, why not run a dedicated Kafka broker on the same node as the
Producer, and connect over localhost? To turn around and write code to
spool to disk, because the primary system that *spools to disk* is down
seems to be missing the point.

That said, even by going over local-host, I guess the network connection
could go down. In that case, Producers should buffer in RAM, and start
sending some major alerts to the Operations team. But this should almost
*never happen*. If it is happening regularly *something is fundamentally
wrong with your system design*. Those Producers should also refuse any more
incoming traffic and await intervention. Even bringing up "netcat -l" and
letting it suck in the data and write it to disk would work then.
Alternatives include having Producers connect to a load-balancer with
multiple Kafka brokers behind it, which helps you deal with any one Kafka
broker failing. Or just have your Producers connect directly to multiple
Kafka brokers, and switch over as needed if any one broker goes down.

I don't know if the standard Kafka producer that ships with Kafka supports
buffering in RAM in an emergency. We wrote our own that does, with a focus
on speed and simplicity, but I expect it will very rarely, if ever, buffer
in RAM.

Building and using semi-reliable system after semi-reliable system, and
chaining them all together, hoping to be more tolerant of failure is not
necessarily a good approach. Instead, identifying that one system that is
critical, and ensuring that it remains up (redundant installations,
redundant disks, redundant network connections etc) is a better approach
IMHO.

Philip


On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao <jun...@gmail.com> wrote:

Another way to handle this is to provision enough client and broker
servers
so that the peak load can be handled without spooling.

Thanks,

Jun


On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski <pi...@liveramp.com
wrote:
Jun,

When talking about "catastrophic consequences" I was actually only
referring to the producer side. in our use case (logging requests from
webapp servers), a spike in traffic would force us to either tolerate a
dramatic increase in the response time, or drop messages, both of which
are
really undesirable. Hence the need to absorb spikes with some system on
top
of Kafka, unless the spooling feature mentioned by Wing (
https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This
is
assuming there are a lot more producer machines than broker nodes, so
each
producer would absorb a small part of the extra load from the spike.

Piotr

On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao <jun...@gmail.com> wrote:

Piotr,

Actually, could you clarify what "catastrophic consequences" did you
see
on
the broker side? Do clients timeout due to longer serving time or
something
else?

Going forward, we plan to add per client quotas (KAFKA-656) to
prevent
the
brokers from being overwhelmed by a runaway client.

Thanks,

Jun


On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

Hi,

Is there anything one can do to "defend" from:

"Trying to push more data than the brokers can handle for any
sustained
period of time has catastrophic consequences, regardless of what
timeout
settings are used. In our use case this means that we need to
either
ensure
we have spare capacity for spikes, or use something on top of Kafka
to
absorb spikes."

?
Thanks,
Otis
----
Performance Monitoring for Solr / ElasticSearch / HBase -
http://sematext.com/spm





________________________________
From: Piotr Kozikowski <pi...@liveramp.com>
To: users@kafka.apache.org
Sent: Tuesday, April 9, 2013 1:23 PM
Subject: Re: Analysis of producer performance

Jun,

Thank you for your comments. I'll reply point by point for
clarity.
1. We were aware of the migration tool but since we haven't used
Kafka
for
production yet we just started using the 0.8 version directly.

2. I hadn't seen those particular slides, very interesting. I'm
not
sure
we're testing the same thing though. In our case we vary the
number
of
physical machines, but each one has 10 threads accessing a pool of
Kafka
producer objects and in theory a single machine is enough to
saturate
the
brokers (which our test mostly confirms). Also, assuming that the
slides
are based on the built-in producer performance tool, I know that
we
started
getting very different numbers once we switched to use "real"
(actual
production log) messages. Compression may also be a factor in case
it
wasn't configured the same way in those tests.

3. In the latency section, there are two tests, one for average
and
another
for maximum latency. Each one has two graphs presenting the exact
same
data
but at different levels of zoom. The first one is to observe small
variations of latency when target throughput <= actual throughput.
The
second is to observe the overall shape of the graph once latency
starts
growing when target throughput > actual throughput. I hope that
makes
sense.
4. That sounds great, looking forward to it.

Piotr

On Mon, Apr 8, 2013 at 9:48 PM, Jun Rao <jun...@gmail.com> wrote:

Piotr,

Thanks for sharing this. Very interesting and useful study. A
few
comments:
1. For existing 0.7 users, we have a migration tool that mirrors
data
from
an 0.7 cluster to an 0.8 cluster. Applications can upgrade to
0.8
by
upgrading consumers first, followed by producers.

2. Have you looked at the Kafka ApacheCon slides (

http://www.slideshare.net/junrao/kafka-replication-apachecon2013
)?
Towards
the end, there are some performance numbers too. The figure for
throughput
vs #producer is different from what you have. Not sure if this
is
because
that you have turned on compression.

3. Not sure that I understand the difference btw the first 2
graphs
in
the
latency section. What's different btw the 2 tests?

4. Post 0.8, we plan to improve the producer side throughput by
implementing non-blocking socket on the client side.

Jun


On Mon, Apr 8, 2013 at 4:42 PM, Piotr Kozikowski <
pi...@liveramp.com>
wrote:

Hi,

At LiveRamp we are considering replacing Scribe with Kafka,
and
as a
first
step we run some tests to evaluate producer performance. You
can
find
our
preliminary results here:

https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
.
We
hope this will be useful for some folks, and If anyone has
comments
or
suggestions about what to do differently to obtain better
results
your
feedback will be very welcome.

Thanks,

Piotr




Reply via email to