Re: Kafka questions

Jun Rao Wed, 20 Jul 2011 09:09:32 -0700

Paul,

The only concern is that if we expose unflushed messages, those messages
could disappear after a broker machine restart.


Jun

On Tue, Jul 19, 2011 at 2:02 PM, Paul Sutter <[email protected]> wrote:

> One more suggestion:
>
> Even before you have replication, it seems that you could delay producer
> side acks until after the data is recorded to disk, and still pass the data
> forward to consumers immediately.
>
>
>
> On Jul 19, 2011, at 12:23 PM, Jay Kreps <[email protected]> wrote:
>
> > Agreed, no reason the policy to hand out messages should not be
> > configurable. We were hoping to make the whole question irrelevant with
> the
> > replication since then the producer can choose the replication level it
> > wants and fsync durability should be less of a concern.
> >
> > I agree with your comment that a good implementation of streaming with
> acks
> > being potentially superior.
> >
> > -Jay
> >
> > On Tue, Jul 19, 2011 at 11:15 AM, Paul Sutter <[email protected]
> >wrote:
> >
> >> Jay,
> >>
> >> Ah - thanks for the clarification on the delay in the broker. It would
> be
> >> nice to if that were a configuration option, so that the end user can
> >> choose
> >> only to forward messages that have been written to disk, or choose to
> have
> >> the data forwarded immediately. When you implement replication data
> hitting
> >> the disk will matter less.
> >>
> >> On the delay in the producer, I think it could best be resolved through
> >> measurement. In your paper you compare two different approaches, and I'm
> >> proposing a third:
> >>
> >> 1. Send and wait (single message, JMS style)
> >> 2. Batch, send, and wait (Kafka today)
> >> 3. Stream with ACKs
> >>
> >> Removing any wait for a reply should increase throughput, not decrease
> it,
> >> so you're likely trading latency against potential CPU efficiency. And
> the
> >> CPU savings is a question best resolved by measurement.
> >>
> >> I'd also encourage you to think about the WAN case. When you
> send-and-wait,
> >> you need to send a buffer that is >> the bandwidth delay product to
> >> approach
> >> full line utilization, and the line will go idle for one RTT while you
> stop
> >> to wait for a reply. The bandwidth*delay product can get large (10s of
> >> megabytes), and end users will rarely understand the need to tune the
> batch
> >> size to increase throughput. They'll just say it's slow over long
> >> distances.
> >>
> >> All that said - your use case doesn't require minimizing latency or WAN
> >> use,
> >> so I can really understand if this isn't a priority for you.
> >>
> >> It's a well designed product that has had some real thought put into it.
> >> It's a really promising system, thanks for taking the time to respond to
> my
> >> comments.
> >>
> >> Paul
> >>
> >> On Tue, Jul 19, 2011 at 10:44 AM, Jay Kreps <[email protected]>
> wrote:
> >>
> >>> Ah, I think what you are describing in zeromq is essentially the
> >> equivalent
> >>> of group commit for the socket. Essentially you wait until the socket
> is
> >> no
> >>> longer writable and then begin to queue data. This is an interesting
> >> idea.
> >>> Of course it would only have a positive effect when you had already
> >>> overflowed the socket buffer and were sending a very high throughput of
> >>> small messages. That basically is a way to degrade an overloaded
> >>> synchronous
> >>> send into a batched send. This is not really the same as what we have
> >> done,
> >>> which is to allow the ability to trade off latency for throughput in a
> >>> configurable manner. The reason the later is important is that we do
> not
> >>> have a handful of producers sending at a rate that saturates the
> network
> >>> I/O
> >>> capacity of those servers (the case where the group commit would help)
> >> but
> >>> rather we have thousands of producers sending at a medium low volume,
> so
> >> we
> >>> would never hit that in our use case. The advantage of batching is
> fewer
> >>> requests that hit the server, and larger packets. Where the group
> commit
> >>> would help is for the synchronous producer benchmarks, where you could
> >>> potentially get much better throughput. This is something we should
> >>> consider
> >>> adding.
> >>>
> >>> To be clear, though, we have not added latency in our layer, just made
> a
> >>> configurable way to trade-off latency for throughput. This is
> >> unambiguously
> >>> a good thing, I think.
> >>>
> >>> With respect to mmap, i think you are misunderstanding where the
> latency
> >>> comes from. We immediately write data to the filesystem with no delay
> >>> whatsoever. This incurs the overhead of a system call, as you point
> out,
> >>> which could be avoided by mmap, but that doesn't add much in the way of
> >>> latency. The latency comes from the fact that we do not make the
> written
> >>> data available to consumers until we fsync the file to "ensure" the
> >>> durability of consumed messages. The frequency of the fsync is
> >>> configurable,
> >>> anything either immediate or with a time or # messages threshold. This
> >>> again
> >>> trades latency for throughput.
> >>>
> >>> -Jay
> >>>
> >>> On Mon, Jul 18, 2011 at 10:26 PM, Paul Sutter <[email protected]
> >>>> wrote:
> >>>
> >>>> *Producer latency* - I'm not familiar with zeromq internals but my
> >>>> understanding is that they send the first message(s) immediately and
> as
> >>> TCP
> >>>> queues up the data, it will eventually block as the send buffer fills,
> >>> and
> >>>> during this time messages can queue up, and thte net-net is that on
> >>> average
> >>>> the number of system calls is << the number of messages. The key is
> >>> having
> >>>> a
> >>>> separate thread for network operations with very efficient thread
> >>>> coordination.  Clearly Nagle is disabled (TCP_NODELAY) as Nagle is a
> >>> blight
> >>>> against humanity.
> >>>>
> >>>> Having any sort of delay adds latency. If every developer thinks its
> OK
> >>> to
> >>>> add a little latency in his layer, pretty soon you end up with 10
> >> second
> >>>> end
> >>>> to end latency.
> >>>>
> >>>> Having an "accumulated message count" is also bad for WAN performance.
> >> If
> >>>> your "window size" is a set of delayed messages, the only way to deal
> >>> with
> >>>> a
> >>>> large bandwidth*delay product is to delay a lot of messages, then send
> >>>> them.
> >>>> You can fit a lot of data into a fiber. Imagine a gigabit link with
> >> 100ms
> >>>> roundtrip time, you can store 100MB in the fiber. And you need a
> >>> multiples
> >>>> of that for buffering if you need to do a retransmit.
> >>>>
> >>>> *Broker Latency *- With mmap the memcpy() of the message should make
> >> the
> >>>> data available to a thread even in another process, the pages that you
> >>> have
> >>>> mapped are also in the buffer cache and available to a sendfile()
> call.
> >>> or
> >>>> at least I think so. The flush to physical disk (or msync() in this
> >> case)
> >>>> would still be delayed but without impacting end to end latency.
> >>>>
> >>>> That said, in benchmarks I have done the fastest IO with the lowest
> CPU
> >>>> overhead is unbuffered (direct) IO (which is lower overhead than using
> >>> the
> >>>> buffer cache with or without memory mapping), but then you'd have to
> >>> manage
> >>>> your own buffer pool and run your broker in a single multithreaded
> >>> process.
> >>>> But thats getting more extreme. Just getting rid of this buffer write
> >>> delay
> >>>> by using memory mapping will remove a big chunk of latency.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jul 18, 2011 at 9:30 PM, Jay Kreps <[email protected]>
> >> wrote:
> >>>>
> >>>>> Hi Paul,
> >>>>>
> >>>>> We are definitely interested in lowering latency--lower is always
> >>>>> better--but that was not a major concern for us so far (we were
> >>> replacing
> >>>> a
> >>>>> system with 1 hour latency), so we haven't focused on it yet. As you
> >>>>> describe latency in our setup at linkedin comes from batching on the
> >>>>> frontend and batching on the kafka servers do to very lenient flush
> >>>>> settings.
> >>>>>
> >>>>> I am interested in your comments on zeromq. Do they actually have a
> >>>> better
> >>>>> approach for this problem even when using TCP? If so I would be
> >>>> interested
> >>>>> to understand. The way I see things this is about trading throughput
> >>> and
> >>>>> latency. On the producer side you have only a few options:
> >> immediately
> >>>>> write
> >>>>> the data to the socket buffer for sending or wait and see if the
> >>>>> application
> >>>>> writes more data. The OS will do this for you unless you set
> >>> TCP_NODELAY,
> >>>>> but the OS is relatively inflexible, it doesn't understand your data
> >> so
> >>> I
> >>>>> think it just waits 200ms or until the socket buffer is full.
> >>>>>
> >>>>> The current approach in the async producer captures the same
> >> tradeoff,
> >>>> but
> >>>>> a
> >>>>> little more flexibly, it allows you to specify a max delay and max
> >>>>> accumulated message count, data is written when either of those is
> >> hit.
> >>>>>
> >>>>> Is it possible to better capture this tradeoff? Basically I am not
> >>> aware
> >>>> of
> >>>>> any other trick here if you are using TCP, so i would be interested
> >> in
> >>>> what
> >>>>> zeromq does if they are doing this better.
> >>>>>
> >>>>> We do indeed write each message set to the filesystem as it arrives
> >> but
> >>>> we
> >>>>> distribute messages to consumers only after the write has been
> >> flushed
> >>> to
> >>>>> disk, delaying (batching) that flush is the cause of the latency but
> >>> also
> >>>>> gives better use of IOPs by generating larger writes. Mmap would
> >> remove
> >>>> the
> >>>>> system call (which would be good), but not the flush I think. As you
> >>> say,
> >>>>> adding replication allows giving stronger guarantees without actually
> >>>>> caring
> >>>>> about durability on a particular server which would make it possible
> >> to
> >>>>> distribute messages to consumers after ack from some number of other
> >>>>> servers
> >>>>> irrespective of flushing to disk.
> >>>>>
> >>>>> -Jay
> >>>>>
> >>>>> On Mon, Jul 18, 2011 at 8:27 PM, Paul Sutter <[email protected]
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Jun
> >>>>>>
> >>>>>> Thanks for your answers and the link to the paper - that helps a
> >> lot,
> >>>>>> especially the comment in the paper that 10 second end to end
> >> latency
> >>>> is
> >>>>>> good enough for your intended use case.
> >>>>>>
> >>>>>> We're looking for much lower latencies, and the basic design of
> >> Kafka
> >>>>> feels
> >>>>>> like it should support latencies in milliseconds with a few
> >> changes.
> >>>>> We're
> >>>>>> either going to build our own system, or help develop something
> >> that
> >>>>>> already
> >>>>>> exists, so please take my comments in the constructive way they're
> >>>>> intended
> >>>>>> (I realize the changes I'm suggesting are outside your intended use
> >>>> case,
> >>>>>> but if you're interested we may be able to provide a very capable
> >>>>> developer
> >>>>>> to help with the work, assuming we choose kafka over the other
> >>> zillion
> >>>>>> streaming systems that are coming out of the woodwork).
> >>>>>>
> >>>>>> a. *Producer "queue.time"* - In my question 4 below, I was
> >> referring
> >>> to
> >>>>> the
> >>>>>> producer queue time.  With a default value of 5 seconds, that
> >>> accounts
> >>>>> for
> >>>>>> half your end to end latency. A system like zeromq is optimized to
> >>>> write
> >>>>>> data immediately without delay, but in such a way to minimizes the
> >>>> number
> >>>>>> of
> >>>>>> system calls required during high throughput messages. Zeromq is no
> >>>>>> nirvana,
> >>>>>> but it has a number of nice properties.
> >>>>>>
> >>>>>> b. *Broker "log.default.flush.interval.ms"* - The default value of
> >> 3
> >>>>>> seconds
> >>>>>> appears to be another significant source of latency in the system,
> >>>>> assuming
> >>>>>> that clients are unable to access data until it has been flushed.
> >>> Since
> >>>>> you
> >>>>>> have wisely chosen to take advantage of the buffer cache as part of
> >>>> your
> >>>>>> system design, it seems that you could remove this latency
> >> completely
> >>>> by
> >>>>>> memory mapping the partitions and memcpying each message as it
> >>> arrives.
> >>>>>> With
> >>>>>> the right IPC mechanism clients could have immediate access to new
> >>>>>> messages.
> >>>>>>
> >>>>>> c. *Batching, sync vs async, replication, and auditing*. Its
> >>>>> understandable
> >>>>>> that you've chosen a a forensic approach to producer reliability
> >>> (after
> >>>>> the
> >>>>>> fact auditing), but when you implement replication it would be
> >> really
> >>>>> nice
> >>>>>> to revise the producer protocol mechanisms. If you used a streaming
> >>>>>> mechanism with producer offsets and ACKs, you could ensure reliable
> >>>>>> delivery
> >>>>>> of producer streams to multiple brokers without the need to choose
> >> a
> >>>>> "batch
> >>>>>> size" or "queue.time". This could also give you active/active
> >>> failover
> >>>> of
> >>>>>> brokers. This may also help in the WAN case (my question 3 below)
> >>>> because
> >>>>>> you will be able to adaptively stuff more and more data through the
> >>>> fiber
> >>>>>> for high bandwidth*delay links without having to choose a large
> >>> "batch
> >>>>>> size"
> >>>>>> nor have the additional latency that entails. Oh, and it will help
> >>> you
> >>>>> deal
> >>>>>> with CRC errors once you start checking for them.
> >>>>>>
> >>>>>> c. *Performance measurements* - I'd like to make a suggestion for
> >>> your
> >>>>>> performance measurements. Your benchmarks measure throughput, but a
> >>>>>> throughput number is meaningless without an associated "% cpu
> >> time".
> >>>>>> Ideally
> >>>>>> all measurements achieve wire speed (100MB/sec) at 0% CPU (since,
> >>> after
> >>>>>> all,
> >>>>>> this is plumbing and we assume the cores in the system should have
> >>>>> capacity
> >>>>>> set aside for useful work too). Obviously nobody ever achieves
> >> this,
> >>>> but
> >>>>> by
> >>>>>> measuring it one can raise the bar in terms of optimization.
> >>>>>>
> >>>>>> Paul
> >>>>>>
> >>>>>> ps. Just for background, I am the cofounder at Quantcast where we
> >>>> process
> >>>>>> 3.5PB of data per day. These questions are related to my new
> >> startup
> >>>>>> Quantbench which will deal with financial market data where you
> >> dont
> >>>> want
> >>>>>> any latency at all. And WAN issues are a big deal too.
> >> Incidentally,
> >>> I
> >>>>> was
> >>>>>> also founder of Orbital Data which was a WAN optimization company
> >> so
> >>>> I've
> >>>>>> done a lot of work with protocols over long distances.
> >>>>>>
> >>>>>> On Mon, Jul 18, 2011 at 7:14 PM, Jun Rao <[email protected]> wrote:
> >>>>>>
> >>>>>>> Paul,
> >>>>>>>
> >>>>>>> Excellent questions. See my answers below. Thanks,
> >>>>>>>
> >>>>>>> On Mon, Jul 18, 2011 at 6:41 PM, Paul Sutter <
> >>> [email protected]
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Kafka looks like an exciting project, thanks for opening it up.
> >>>>>>>>
> >>>>>>>> I have a few questions:
> >>>>>>>>
> >>>>>>>> 1. Are checksums end to end (ie, created by the producer and
> >>>> checked
> >>>>> by
> >>>>>>> the
> >>>>>>>> consumer)? or are they only used to confirm buffercache
> >> behavior
> >>> on
> >>>>>> disk
> >>>>>>> as
> >>>>>>>> mentioned in the documentation? Bit errors occur vastly more
> >>> often
> >>>>> than
> >>>>>>>> most
> >>>>>>>> people assume, often because of device driver bugs. TCP only
> >>>> detects
> >>>>> 1
> >>>>>>>> error
> >>>>>>>> in 65536, so errors can flow through (if you like I can send
> >>> links
> >>>> to
> >>>>>>>> papers
> >>>>>>>> describing the need for checksums everywhere).
> >>>>>>>>
> >>>>>>>
> >>>>>>> Checksum is generated at the producer and propagated to the
> >> broker
> >>>> and
> >>>>>>> eventually the consumer. Currently, we only validate the checksum
> >>> at
> >>>>> the
> >>>>>>> broker. We could further validate it at the consumer in the
> >> future.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> 2. The consumer has a pretty solid mechanism to ensure it hasnt
> >>>>> missed
> >>>>>>> any
> >>>>>>>> messages (i like the design by the way), but how does the
> >>> producer
> >>>>> know
> >>>>>>>> that
> >>>>>>>> all of its messages have been stored? (no apparent message id
> >> on
> >>>> that
> >>>>>>> side
> >>>>>>>> since the message id isnt known until the message is written to
> >>> the
> >>>>>>> file).
> >>>>>>>> I'm especially curious how failover/replication could be
> >>>> implemented
> >>>>>> and
> >>>>>>>> I'm
> >>>>>>>> thinking that acks on the publisher side may help)
> >>>>>>>>
> >>>>>>>
> >>>>>>> The producer side auditing is not built-in. At LinkedIn, we do
> >> that
> >>>> by
> >>>>>>> generating an auditing event periodically in the eventhandler of
> >>> the
> >>>>>> async
> >>>>>>> producer. The auditing event contains the number of events
> >> produced
> >>>> in
> >>>>> a
> >>>>>>> configured window (e.g., 10 minutes) and are sent to a separate
> >>>> topic.
> >>>>>> The
> >>>>>>> consumer can read the actual data and the auditing event and
> >>> compare
> >>>>> the
> >>>>>>> counts. See our paper (
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
> >>>>>>> )
> >>>>>>> for some more details.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> 3. Has the consumer's flow control been tested over high
> >>>>>> bandwidth*delay
> >>>>>>>> links? (what bandwidth can you get from a London consumer of an
> >>> SF
> >>>>>>>> cluster?)
> >>>>>>>>
> >>>>>>>> Yes, we actually replicate kafka data across data centers,
> >> using
> >>> an
> >>>>>>> embedded consumer in a broker. Again, there is a bit more info on
> >>>> this
> >>>>> in
> >>>>>>> our paper.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 4. What kind of performance do you get if you set the
> >> producer's
> >>>>>> message
> >>>>>>>> delay to zero? (ie, is there a separate system call for each
> >>>> message?
> >>>>>> or
> >>>>>>> do
> >>>>>>>> you manage to aggregate messages into a smaller number of
> >> system
> >>>>> calls
> >>>>>>> even
> >>>>>>>> with a delay of 0?)
> >>>>>>>>
> >>>>>>>> I assume that you are referring to the flush interval. One can
> >>>>>> configure
> >>>>>>> to
> >>>>>>> flush every message to disk. This will slow down the throughput
> >>>>>>> significantly.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 5. Have you considered using a library like zeromq for the
> >>>> messaging
> >>>>>>> layer
> >>>>>>>> instead of rolling your own? (zeromq will handle #4 cleanly at
> >>>>> millions
> >>>>>>> of
> >>>>>>>> messages per second and has clients in 20 languages)
> >>>>>>>>
> >>>>>>>> No. Our proprietary format allows us to support things like
> >>>>> compression
> >>>>>>> in
> >>>>>>> the future. However, we can definitely look into the zeromq
> >> format.
> >>>> Is
> >>>>>>> their
> >>>>>>> messaging layer easily extractable?
> >>>>>>>
> >>>>>>>
> >>>>>>>> 6. Do you have any plans to support intermediate processing
> >>>> elements
> >>>>>> the
> >>>>>>>> way
> >>>>>>>> Flume supports?
> >>>>>>>>
> >>>>>>>> For now, we are just focusing on getting the raw messaging
> >> layer
> >>>>> solid.
> >>>>>>> We
> >>>>>>> have worked a bit on streaming processing and will look into that
> >>>> again
> >>>>>> in
> >>>>>>> the future.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 7. The docs mention that new versions will only be released
> >> after
> >>>>> they
> >>>>>>> are
> >>>>>>>> in production at LinkedIn? Does that mean that the latest
> >> version
> >>>> of
> >>>>>> the
> >>>>>>>> source code is hidden at LinkedIn and contributors would have
> >> to
> >>>>> throw
> >>>>>>>> patches over the wall and wait months to get the integrated
> >>>> product?
> >>>>>>>>
> >>>>>>>> What we ran at LinkedIn is the same version in open source and
> >>>> there
> >>>>> is
> >>>>>>> no
> >>>>>>> internal repository of Kafka at LinkedIn. We plan to maintain
> >> that
> >>> in
> >>>>> the
> >>>>>>> future.
> >>>>>>>
> >>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: Kafka questions

Reply via email to