Re: Analysis of producer performance -- and Producer-Kafka reliability

David Arthur Tue, 23 Apr 2013 05:22:57 -0700

It seems there are two underlying things here: storing messages tostable storage, and making messages available to consumers (i.e.,storing messages on the broker). One can be achieved simply and reliablyby spooling to local disk, the other requires network and is inherentlyless reliable. Buffering messages in memory does not help with the firstone since they are in volatile storage, but it does help with the secondone in the event of a network partition.

I could imagine a producer running in "ultra-reliability" mode where ituses a local log file as a buffer where all messages written to and readfrom. One issue with this, though, is that now you have to worry aboutthe performance and capacity of the disks on your producers (which canbe numerous compared to brokers). As for performance, the data beingwritten by producers is already in active memory, so writing it to adisk then doing a zero-copy transfer to the network should be prettyfast (maybe?).

Or, Kafka can remain more "protocol-ish" and less "application-y" andjust give you errors when brokers are unavailable and let yourapplication deal with it. This is basically what TCP/HTTP/etc do. HTTPservers don't say "hold on, there's a problem, let me try that requestagain in a second.."


Interesting discussion, btw :)
-David

On 4/15/13 2:18 PM, Piotr Kozikowski wrote:

Philip,

We would not use spooling to local disk on the producer to deal with
problems with the connection to the brokers, but rather to absorb temporary
spikes in traffic that would overwhelm the brokers. This is assuming that
1) those spikes are relatively short, but when they come they require much
higher throughput than normal (otherwise we'd just have a capacity problem
and would need more brokers), and 2) the spikes are long enough for just a
RAM buffer to be dangerous. If the brokers did go down, spooling to disk
would give us more time to react, but that's not the primary reason for
wanting the feature.

-Piotr

On Fri, Apr 12, 2013 at 8:21 AM, Philip O'Toole <phi...@loggly.com> wrote:

This is just my opinion of course (who else's could it be? :-)) but I think
from an engineering point of view, one must spend one's time making the
Producer-Kafka connection solid, if it is mission-critical.

Kafka is all about getting messages to disk, and assuming your disks are
solid (and 0.8 has replication) those messages are safe. To then try to
build a system to cope with the Kafka brokers being unavailable seems like
you're setting yourself for infinite regress. And to write code in the
Producer to spool to disk seems even more pointless. If you're that
worried, why not run a dedicated Kafka broker on the same node as the
Producer, and connect over localhost? To turn around and write code to
spool to disk, because the primary system that *spools to disk* is down
seems to be missing the point.

That said, even by going over local-host, I guess the network connection
could go down. In that case, Producers should buffer in RAM, and start
sending some major alerts to the Operations team. But this should almost
*never happen*. If it is happening regularly *something is fundamentally
wrong with your system design*. Those Producers should also refuse any more
incoming traffic and await intervention. Even bringing up "netcat -l" and
letting it suck in the data and write it to disk would work then.
Alternatives include having Producers connect to a load-balancer with
multiple Kafka brokers behind it, which helps you deal with any one Kafka
broker failing. Or just have your Producers connect directly to multiple
Kafka brokers, and switch over as needed if any one broker goes down.

I don't know if the standard Kafka producer that ships with Kafka supports
buffering in RAM in an emergency. We wrote our own that does, with a focus
on speed and simplicity, but I expect it will very rarely, if ever, buffer
in RAM.

Building and using semi-reliable system after semi-reliable system, and
chaining them all together, hoping to be more tolerant of failure is not
necessarily a good approach. Instead, identifying that one system that is
critical, and ensuring that it remains up (redundant installations,
redundant disks, redundant network connections etc) is a better approach
IMHO.

Philip


On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao <jun...@gmail.com> wrote:

Another way to handle this is to provision enough client and broker

servers

so that the peak load can be handled without spooling.

Thanks,

Jun


On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski <pi...@liveramp.com

wrote:
Jun,

When talking about "catastrophic consequences" I was actually only
referring to the producer side. in our use case (logging requests from
webapp servers), a spike in traffic would force us to either tolerate a
dramatic increase in the response time, or drop messages, both of which

are

really undesirable. Hence the need to absorb spikes with some system on

top

of Kafka, unless the spooling feature mentioned by Wing (
https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This

is

assuming there are a lot more producer machines than broker nodes, so

each

producer would absorb a small part of the extra load from the spike.

Piotr

On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao <jun...@gmail.com> wrote:

Piotr,

Actually, could you clarify what "catastrophic consequences" did you

see

on

the broker side? Do clients timeout due to longer serving time or

something

else?

Going forward, we plan to add per client quotas (KAFKA-656) to

prevent

the

brokers from being overwhelmed by a runaway client.

Thanks,

Jun


On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

Hi,

Is there anything one can do to "defend" from:

"Trying to push more data than the brokers can handle for any

sustained

period of time has catastrophic consequences, regardless of what

timeout

settings are used. In our use case this means that we need to

either

ensure

we have spare capacity for spikes, or use something on top of Kafka

to

absorb spikes."

?
Thanks,
Otis
----
Performance Monitoring for Solr / ElasticSearch / HBase -
http://sematext.com/spm

________________________________
From: Piotr Kozikowski <pi...@liveramp.com>
To: users@kafka.apache.org
Sent: Tuesday, April 9, 2013 1:23 PM
Subject: Re: Analysis of producer performance

Jun,

Thank you for your comments. I'll reply point by point for

clarity.

1. We were aware of the migration tool but since we haven't used

Kafka

for

production yet we just started using the 0.8 version directly.

2. I hadn't seen those particular slides, very interesting. I'm

not

sure

we're testing the same thing though. In our case we vary the

number

of

physical machines, but each one has 10 threads accessing a pool of

Kafka

producer objects and in theory a single machine is enough to

saturate

the

brokers (which our test mostly confirms). Also, assuming that the

slides

are based on the built-in producer performance tool, I know that

we

started

getting very different numbers once we switched to use "real"

(actual

production log) messages. Compression may also be a factor in case

it

wasn't configured the same way in those tests.

3. In the latency section, there are two tests, one for average

and

another

for maximum latency. Each one has two graphs presenting the exact

same

data

but at different levels of zoom. The first one is to observe small
variations of latency when target throughput <= actual throughput.

The

second is to observe the overall shape of the graph once latency

starts

growing when target throughput > actual throughput. I hope that

makes

sense.

4. That sounds great, looking forward to it.

Piotr

On Mon, Apr 8, 2013 at 9:48 PM, Jun Rao <jun...@gmail.com> wrote:

Piotr,

Thanks for sharing this. Very interesting and useful study. A

few

comments:

1. For existing 0.7 users, we have a migration tool that mirrors

data

from

an 0.7 cluster to an 0.8 cluster. Applications can upgrade to

0.8

by

upgrading consumers first, followed by producers.

2. Have you looked at the Kafka ApacheCon slides (

http://www.slideshare.net/junrao/kafka-replication-apachecon2013

)?

Towards

the end, there are some performance numbers too. The figure for

throughput

vs #producer is different from what you have. Not sure if this

is

because

that you have turned on compression.

3. Not sure that I understand the difference btw the first 2

graphs

in

the

latency section. What's different btw the 2 tests?

4. Post 0.8, we plan to improve the producer side throughput by
implementing non-blocking socket on the client side.

Jun


On Mon, Apr 8, 2013 at 4:42 PM, Piotr Kozikowski <

pi...@liveramp.com>

wrote:

Hi,

At LiveRamp we are considering replacing Scribe with Kafka,

and

as a

first

step we run some tests to evaluate producer performance. You

can

find

our

preliminary results here:

https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/

We

hope this will be useful for some folks, and If anyone has

comments

or

suggestions about what to do differently to obtain better

results

your

feedback will be very welcome.

Thanks,

Piotr

Re: Analysis of producer performance -- and Producer-Kafka reliability

Reply via email to