Hi all,

First, thanks to Tim (from Rabbit) and Jonathan for moving this thread
along.  Jonathan, I hope you found my links to the data model docs,
and Tim's replies, helpful.

Has everyone got what they wanted from this thread?

alexis


On Tue, Jun 11, 2013 at 5:49 PM, Jonathan Hodges <hodg...@gmail.com> wrote:
> Hi Tim,
>
> While your comments regarding durability are accurate for 0.7 version of
> Kafka, it is a bit greyer with 0.8.  In 0.8 you have the ability to
> configure Kafka to have the durability you need.  This is what I was
> referring to with the link to Jun’s ApacheCon slides (
> http://www.slideshare.net/junrao/kafka-replication-apachecon2013).
>
> If you look at slide 21 titled, ‘Data Flow in Replication’ you see the
> three possible durability configurations which tradeoff latency for greater
> persistence guarantees.
>
> The third row is the ‘no data loss’ configuration option where the producer
> only receives an ack from the broker once the message(s) are committed by
> the leader and peers (mirrors as you call them) and flushed to disk.  This
> seems to be very similar to the scenario you describe in Rabbit, no?
>
> Jun or Neha can you please confirm my understanding of 0.8 durability is
> correct and there is no data loss in the scenario I describe?  I know there
> is a separate configuration setting, log.flush.interval.messages, but I
> thought in sync mode the producer doesn’t receive an ack until message(s)
> are committed and flushed to disk.  Please correct me if my understanding
> is incorrect.
>
> Thanks!
>
>
> On Tue, Jun 11, 2013 at 8:20 AM, Tim Watson <watson.timo...@gmail.com>wrote:
>
>> Hi Jonathan,
>>
>> So, thanks for replying - that's all useful info.
>>
>> On 10 Jun 2013, at 14:19, Jonathan Hodges wrote:
>> > Kafka has a configurable rolling window of time it keeps the messages per
>> > topic.  The default is 7 days and after this time the messages are
>> removed
>> > from disk by the broker.
>> > Correct, the consumers maintain their own state via what are known as
>> > offsets.  Also true that when producers/consumers contact the broker
>> there
>> > is a random seek to the start of the offset, but the majority of access
>> > patterns are linear.
>> >
>>
>> So, just to be clear, the distinction that has been raised on this thread
>> is only part of the story, viz the difference in rates between RabbitMQ and
>> Kafka. Essentially, these two systems are performing completely different
>> tasks, since in RabbitMQ, the concept of a long-term persistent topic whose
>> entries are removed solely based on expiration policy is somewhat alien.
>> RabbitMQ will delete messages from its message store as soon as a relevant
>> consumer has seen and ACK'ed them, which *requires* tracking consumer state
>> in the broker. I suspect this was your (earlier) point about Kafka /not/
>> trying to be a general purpose message broker, but having an architecture
>> that is highly tuned to a specific set of usage patterns.
>>
>> >> As you can see in the last graph of 10 million messages which is less
>> than
>> >> a GB on disk, the Rabbit throughput is capped around 10k/sec.  Beyond
>> >> throughput, with the pending release of 0.8, Kafka will also have
>> >> advantages around message guarantees and durability.
>> >>
>> >
>> [snip]
>> > Correct with 0.8 Kafka will have similar options like Rabbit fsync
>> > configuration option.
>>
>> Right, but just to be clear, unless Kafka starts to fsync for every single
>> published message, you are /not/ going to offer the same guarantee. In this
>> respect, rabbit is clearly putting safety above performance when that's
>> what users ask it for, which is fine for some cases and not for others. By
>> way of example, if you're using producer/publisher confirms with RabbitMQ,
>> the broker will not ACK receipt of a message until (a) it has been fsync'ed
>> to disk and (b) if the queue is mirrored, each mirror has acknowledged
>> receipt of the message. Again, unless you're fsync-ing to disk on each
>> publish, the guarantees will be different - and rightly so, since you can
>> deal with re-publishing and de-duplication quite happily in a system that's
>> dealing with a 7-day sliding window of data and thus ensuring throughput is
>> more useful (in that case) than avoiding data loss on the server.
>>
>> Of course, architecturally, fsync-ing very regularly will kill the
>> benefits that mmap combined with sendfile give you, since relying on the
>> kernel's paging / caching capabilities is the whole point of doing that.
>> That's not intended to be a criticism btw, just an observation about the
>> distinction between the two system's differing approaches.
>>
>> >  Messages have always had ordering guarantees, but
>> > with 0.8 there is the notion of topic replicas similar to replication
>> > factor in Hadoop or Cassandra.
>> >
>> > http://www.slideshare.net/junrao/kafka-replication-apachecon2013
>> >
>> > With configuration you can tradeoff latency for durability with 3
>> options.
>> >  - Producer receives no acks (no network delay)
>> >  - Producer waits for ack from broker leader (1 network roundtrip)
>> >  - Producer waits for quorum ack (2 network roundtrips)
>> >
>>
>> Sounds very interesting, I'll take a look.
>>
>> > With the combination of quorum commits and consumers managing state you
>> can
>> > get much closer to exactly once guarantees i.e. the consumers can manage
>> > their consumption state as well as the consumed messages in the same
>> > transaction.
>> >
>>
>> Hmn. This idea (of exactly once delivery) has been long debated in the
>> rabbit community. For example
>> http://rabbitmq.1065348.n5.nabble.com/Exactly-Once-Delivery-td16826.htmlcovers
>>  a number of objections presented to doing this, though again, since
>> Kafka is addressing a different problem space, perhaps the constraints
>> differ somewhat.
>>
>> Cheers,
>> Tim
>>
>>
>>

Reply via email to