Hi Arya,

In the case of a kernel panic or other failure serious enough to bring
a machine down (hardware failure, power outage, etc.), message loss is
definitely possible, and not totally preventable regardless of what
precautions software takes to avoid data loss.  To ensure high
throughput, Dory batches messages together before forwarding them to
Kafka.  The amount of batching Dory performs is configurable on a
per-topic basis, and in practice will typically cause a delay of
between several hundred milliseconds and several seconds before
messages are forwarded to Kafka.  Thus, if a machine on which Dory is
running crashes, you can expect to lose up to a few seconds of batched
messages depending on the batching configuration.  Additionally, you
may lose some sent messages for which Dory has not yet received an ACK
from Kafka indicating successful receipt and persistence to disk.

To narrow the window of possible message loss, one can set a very
short batching delay, or even completely disable batching for one or
more topics.  However this is not recommended since it will have a
serious adverse affect on the performance of your Kafka cluster.
Another possibility would be for Dory to immediately persist messages
to local disk on receipt, even before they are batched, and then
delete them only after a non-error ACK has been received from Kafka.
This would greatly narrow, although not totally eliminate the window
for message loss.  However it would add substantial complexity to
Dory, and impose a substantial disk I/O performance cost.  An
important consideration is that when you write to a file, the kernel's
default action is to not immediately write the data to disk.  It
buffers the data in its page cache for a period of time (typically up
to 30 seconds) to improve throughput.  If the machine crashes, all of
this unwritten data is lost.  You can get around this by calling
fsync() on a file descriptor or specifying O_DIRECT | O_SYNC when
opening a file, but this will degrade I/O performance.  Dory avoids
persisting data to disk with the view that these downsides outweigh
the benefits of avoiding up to a few seconds of data loss when a
machine crashes.  However a feature I intend to add (described later
in this email) will provide useful information about which messages
were successfully delivered to Kafka in the event of a machine crash.

Dory takes great care to avoid message loss due to circumstances
other than machine crashes.  When Dory starts up, it preallocates a
fixed amount of memory (configurable with the --msg_buffer_max N
command line option) for storing messages received from clients.
Dory holds on to each message it receives until one of the following
has occurred:

    - It receives an ACK from Kafka indicating that the message has
      been successfully received and persisted to disk.

    - It receives an error ACK from Kafka that causes the message to
      be immediately discarded.  Certain error ACKs cause Dory to
      immediately discard messages, and others cause it to pause,
      update its metadata, and attempt redelivery based on the new
      metadata.  ACK error 2 (invalid message, i.e. CRC error) causes
      Dory to immediately attempt redelivery without updating its
      metadata.  The actions Dory takes for various error ACK values
      are documented here:

          
https://github.com/dspeterson/dory/blob/master/doc/design.md#dispatcher

    - As mentioned above, certain error ACKs will cause Dory to
      attempt redelivery.  Each time this occurs for a given message,
      Dory increments a failed delivery attempt count on the message.
      Once the count exceeds a certain value (specified by the
      --max_failed_delivery_attempts N command line option), Dory
      discards the message.

    - It has determined that the message is undeliverable due to a
      problem such as the message being malformed (see
      https://github.com/dspeterson/dory/blob/master/doc/sending_messages.md
      which documents Dory's input message format) or having an
      invalid topic.  In this case, the message is discarded.

    - It has exhausted its preallocated memory for storing messages.
      This may occur in the case where a network-related outage or
      serious problem with the Kafka cluster persists for an extended
      period of time, preventing normal message delivery.  In this
      case, new messages received from clients are discarded.

    - Dory receives a shutdown signal (SIGTERM or SIGINT).  In this
      case, Dory immediately stops accepting new messages from
      clients, and for a certain configurable period of time attempts
      to deliver any pending messages.  Once the time limit expires,
      any remaining messages are lost.  This type of message loss
      can be avoided by not shutting Dory down until no more clients
      are sending it messages, and Dory has been given enough time to
      process all pending messages.

A critical aspect of Dory's behavior is that every discard is tracked
along with the reason why the discard occurred.  Dory provides a web
interface from which periodic discard reports and other status
information can be obtained.  Thus, in the rare event when discards
occur, they can be tracked and recorded on a per-topic basis.  The
intent is for monitoring infrastructure to periodically ask Dory for
a discard report, alert an administrator if message loss occurred, and
optionally store the discard reports in a database so that a queryable
history of data quality is available.

A relevant feature I intend to add at some point is described here:

    http://dory.wikidot.com/wiki:commit-points

This notion of a commit point can be useful in the event of a machine
crash.  If monitoring infrastructure periodically querys Dory for
commit point information, then in the event of a machine crash, we
know that any resulting message loss is limited to messages with
timestamps > T, where T is the latest commit point.

I hope this helps answer your questions, and that I haven't inundated
you with too much information.  Let me know if you have more
questions.


Regards,
Dave


-----Original Message-----
From: "Arya Ketan" <ketan.a...@gmail.com>
Sent: Sunday, June 12, 2016 11:52am
To: users@kafka.apache.org
Subject: Re: Introducing Dory

Hi Dave,
Dory looks pretty interesting. I had a few  further questions on it
a) How does Dory handle kernel panics?
b) What kind of message guarantees does dory provide and also if you can
share some design decisions taken to enable the guarantees whatever they
are.

Thanks
Arya

Arya

On Sun, Jun 12, 2016 at 8:10 PM, Dave Peterson <d...@dspeterson.com> wrote:

> Thanks!  Enjoy :-)
>
>
>
> On 6/12/2016 12:24 AM, Gwen Shapira wrote:
>
>> Dory is pretty cool (even though it is named after a somewhat dorky
>> fish). Thank you for sharing :)
>>
>> On Sun, Jun 12, 2016 at 1:24 AM, Dave Peterson <d...@dspeterson.com>
>> wrote:
>>
>>> Hello Kafka users,
>>>
>>> Version 1.1.0 of Dory is now available.  See
>>> https://github.com/dspeterson/dory for details.  Dory is the successor
>>> to Bruce (https://github.com/tagged/bruce), a Kafka producer daemon I
>>> created while working at if(we) (http://www.ifwe.co/).  The code has
>>> seen a number of improvements since its initial release in September
>>> 2014.  The list of example clients for various programming languages
>>> has also been extended.  Dory maintains full backward compatibility
>>> with Bruce, so existing users can easily switch.
>>>
>>> The latest release adds support for receiving messages from clients by
>>> UNIX domain stream socket or local TCP.  Although UNIX domain
>>> datagrams are still the preferred means of sending messages in most
>>> cases, the option of using stream sockets facilitates sending messages
>>> too large to fit in a single datagram.  The local TCP option
>>> facilitates adding support for clients written in programming
>>> languages that do not provide easy access to UNIX domain sockets.
>>>
>>> Dory's wiki page http://dory.wikidot.com/start contains a list of
>>> ideas for additional features and other improvements.  Community
>>> feedback is welcomed and appreciated.  If you have ideas for things
>>> you would like to see in future releases, please add them to the list.
>>> Also, please contribute code if you can afford the time.
>>>
>>>
>>> Thanks,
>>> Dave Peterson
>>>
>>>
>>>
>


Reply via email to