Hi Arya, In the case of a kernel panic or other failure serious enough to bring a machine down (hardware failure, power outage, etc.), message loss is definitely possible, and not totally preventable regardless of what precautions software takes to avoid data loss. To ensure high throughput, Dory batches messages together before forwarding them to Kafka. The amount of batching Dory performs is configurable on a per-topic basis, and in practice will typically cause a delay of between several hundred milliseconds and several seconds before messages are forwarded to Kafka. Thus, if a machine on which Dory is running crashes, you can expect to lose up to a few seconds of batched messages depending on the batching configuration. Additionally, you may lose some sent messages for which Dory has not yet received an ACK from Kafka indicating successful receipt and persistence to disk.
To narrow the window of possible message loss, one can set a very short batching delay, or even completely disable batching for one or more topics. However this is not recommended since it will have a serious adverse affect on the performance of your Kafka cluster. Another possibility would be for Dory to immediately persist messages to local disk on receipt, even before they are batched, and then delete them only after a non-error ACK has been received from Kafka. This would greatly narrow, although not totally eliminate the window for message loss. However it would add substantial complexity to Dory, and impose a substantial disk I/O performance cost. An important consideration is that when you write to a file, the kernel's default action is to not immediately write the data to disk. It buffers the data in its page cache for a period of time (typically up to 30 seconds) to improve throughput. If the machine crashes, all of this unwritten data is lost. You can get around this by calling fsync() on a file descriptor or specifying O_DIRECT | O_SYNC when opening a file, but this will degrade I/O performance. Dory avoids persisting data to disk with the view that these downsides outweigh the benefits of avoiding up to a few seconds of data loss when a machine crashes. However a feature I intend to add (described later in this email) will provide useful information about which messages were successfully delivered to Kafka in the event of a machine crash. Dory takes great care to avoid message loss due to circumstances other than machine crashes. When Dory starts up, it preallocates a fixed amount of memory (configurable with the --msg_buffer_max N command line option) for storing messages received from clients. Dory holds on to each message it receives until one of the following has occurred: - It receives an ACK from Kafka indicating that the message has been successfully received and persisted to disk. - It receives an error ACK from Kafka that causes the message to be immediately discarded. Certain error ACKs cause Dory to immediately discard messages, and others cause it to pause, update its metadata, and attempt redelivery based on the new metadata. ACK error 2 (invalid message, i.e. CRC error) causes Dory to immediately attempt redelivery without updating its metadata. The actions Dory takes for various error ACK values are documented here: https://github.com/dspeterson/dory/blob/master/doc/design.md#dispatcher - As mentioned above, certain error ACKs will cause Dory to attempt redelivery. Each time this occurs for a given message, Dory increments a failed delivery attempt count on the message. Once the count exceeds a certain value (specified by the --max_failed_delivery_attempts N command line option), Dory discards the message. - It has determined that the message is undeliverable due to a problem such as the message being malformed (see https://github.com/dspeterson/dory/blob/master/doc/sending_messages.md which documents Dory's input message format) or having an invalid topic. In this case, the message is discarded. - It has exhausted its preallocated memory for storing messages. This may occur in the case where a network-related outage or serious problem with the Kafka cluster persists for an extended period of time, preventing normal message delivery. In this case, new messages received from clients are discarded. - Dory receives a shutdown signal (SIGTERM or SIGINT). In this case, Dory immediately stops accepting new messages from clients, and for a certain configurable period of time attempts to deliver any pending messages. Once the time limit expires, any remaining messages are lost. This type of message loss can be avoided by not shutting Dory down until no more clients are sending it messages, and Dory has been given enough time to process all pending messages. A critical aspect of Dory's behavior is that every discard is tracked along with the reason why the discard occurred. Dory provides a web interface from which periodic discard reports and other status information can be obtained. Thus, in the rare event when discards occur, they can be tracked and recorded on a per-topic basis. The intent is for monitoring infrastructure to periodically ask Dory for a discard report, alert an administrator if message loss occurred, and optionally store the discard reports in a database so that a queryable history of data quality is available. A relevant feature I intend to add at some point is described here: http://dory.wikidot.com/wiki:commit-points This notion of a commit point can be useful in the event of a machine crash. If monitoring infrastructure periodically querys Dory for commit point information, then in the event of a machine crash, we know that any resulting message loss is limited to messages with timestamps > T, where T is the latest commit point. I hope this helps answer your questions, and that I haven't inundated you with too much information. Let me know if you have more questions. Regards, Dave -----Original Message----- From: "Arya Ketan" <ketan.a...@gmail.com> Sent: Sunday, June 12, 2016 11:52am To: users@kafka.apache.org Subject: Re: Introducing Dory Hi Dave, Dory looks pretty interesting. I had a few further questions on it a) How does Dory handle kernel panics? b) What kind of message guarantees does dory provide and also if you can share some design decisions taken to enable the guarantees whatever they are. Thanks Arya Arya On Sun, Jun 12, 2016 at 8:10 PM, Dave Peterson <d...@dspeterson.com> wrote: > Thanks! Enjoy :-) > > > > On 6/12/2016 12:24 AM, Gwen Shapira wrote: > >> Dory is pretty cool (even though it is named after a somewhat dorky >> fish). Thank you for sharing :) >> >> On Sun, Jun 12, 2016 at 1:24 AM, Dave Peterson <d...@dspeterson.com> >> wrote: >> >>> Hello Kafka users, >>> >>> Version 1.1.0 of Dory is now available. See >>> https://github.com/dspeterson/dory for details. Dory is the successor >>> to Bruce (https://github.com/tagged/bruce), a Kafka producer daemon I >>> created while working at if(we) (http://www.ifwe.co/). The code has >>> seen a number of improvements since its initial release in September >>> 2014. The list of example clients for various programming languages >>> has also been extended. Dory maintains full backward compatibility >>> with Bruce, so existing users can easily switch. >>> >>> The latest release adds support for receiving messages from clients by >>> UNIX domain stream socket or local TCP. Although UNIX domain >>> datagrams are still the preferred means of sending messages in most >>> cases, the option of using stream sockets facilitates sending messages >>> too large to fit in a single datagram. The local TCP option >>> facilitates adding support for clients written in programming >>> languages that do not provide easy access to UNIX domain sockets. >>> >>> Dory's wiki page http://dory.wikidot.com/start contains a list of >>> ideas for additional features and other improvements. Community >>> feedback is welcomed and appreciated. If you have ideas for things >>> you would like to see in future releases, please add them to the list. >>> Also, please contribute code if you can afford the time. >>> >>> >>> Thanks, >>> Dave Peterson >>> >>> >>> >