Re: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable 'at-least-once' message delivery at large scale

Rainer Gerhards Fri, 23 Jan 2015 14:04:17 -0800

Well,  an output can block when forwarding. We do a similar thing with the
windows products, via another protocol. But as you say David, it's pretty
problematic in many cases.


But as I said, I really can't help develop this right now.  That includes
in depth design discussions.

Rainer

Sent from phone, thus brief.
Am 23.01.2015 22:54 schrieb "David Lang" <[email protected]>:

> The huge problem would be how to pass data (the ack) back up from the
> final receiver to the original sender. I don't think rsyslog would be able
> to do this without a MAJOR rewrite (currently the earlier machines would
> have completely forgotten about the message by the time the ack gets
> generated)
>
> David Lang
>
> On Fri, 23 Jan 2015, Rainer Gerhards wrote:
>
>  Sorry to be that blunt, but I simply have no time to participate in
>> developing this. But I would be very open to merge any results.
>>
>> Rainer
>>
>> Sent from phone, thus brief.
>> Am 23.01.2015 22:17 schrieb "singh.janmejay" <[email protected]>:
>>
>>  On Sat, Jan 24, 2015 at 2:19 AM, David Lang <[email protected]> wrote:
>>>
>>> > RELP is the network protocol you need for this sort of reliability.
>>> > However, you would also need to not allow any message to be stored in
>>> > memory (because it would be lost if rsyslog crashes or the system
>>> reboots
>>> > unexpectedly). You would have to use disk queues (not disk assisted
>>> queues)
>>> > everywhere and do some other settings (checkpoint interval of 1 for
>>> example)
>>> >
>>> > This would absolutly cripple your performance due to the disk I/O
>>> > limitations. I did some testing of this a few years ago. I was using a
>>> > high-end PCI SSD (a 160G card cost >$5K at the time) and depending on
>>> the
>>> > filesystem I used, I could get rsyslog to receive between 2K and 8K
>>> > messages/sec. The same hardware writing to a 7200rpm SATA drive with
>>> memory
>>> > buffering allowed could handle 380K messages/sec (the limiting factor
>>> was
>>> > the Gig-E network)
>>> >
>>> > Doing this sort of reliability on a 15Krpm SAS drive would limit you to
>>> > ~50 logs/sec. Modern SSDs would be able to do better, I would guess a
>>> few
>>> > hundred logs/sec from a good drive, but you would be chewing through
>>> the
>>> > drive lifetime several thousand times faster than if you were allowing
>>> > memory buffering.
>>> >
>>> > Very few people have logs that are critical enough to warrent this sort
>>> of
>>> > performance degredation.
>>> >
>>>
>>> I didn't particularly have disk-based queues in mind for reliability
>>> reasons. However, messages may need to overflow to disk to manage bursts
>>> (but only for burstability reasons). For a large-architecture for this
>>> nature, its generally useful to classify failures in a broad way (rather
>>> than very granular failure modes, that we identify for transactional
>>> databases etc). The reason for this ties back to self-healing. Its easier
>>> to build self-healing mechanisms assuming only one kind of failure, node
>>> loss. It could happen for multiple reasons, but if we treat it that way,
>>> all we have to do is build room for managing the cluster when 1 (or k)
>>> nodes are lost.
>>>
>>> So thinking of it that way, a rsyslog crash, or a machine-crash or a disk
>>> failure are all the same to me. They are just node loss (we may be able
>>> to
>>> bring the node back with some offline procedure), but it'll come back as
>>> a
>>> fresh machine with no state.
>>>
>>> Which is why I treat K-safety as a basic design parameter. If K nodes
>>> disappear, data will be lost.
>>>
>>> With this kind of coarse-grained failure-mode, messages can easily be
>>> kept
>>> in memory.
>>>
>>>
>>> >
>>> > In addition, this sort of reliability is saying that you would rather
>>> have
>>> > your applications freeze than have them do something and not have it
>>> > logged. And that you are willing to have your application slow down to
>>> the
>>> > speed of the logging. Very few people are willing to do this.
>>> >
>>> >
>>> >
>>> > You are proposing doing the application ack across multiple hops
>>> instead
>>> > of doing it hop-by-hop. This would avoid the problem that can happen
>>> with
>>> > hop-by-hop acks where a machine that has acked a message then dies and
>>> > needs to be recovered before the message can get delivered (assuming
>>> you
>>> > have redundant storage and enough of the storage survives to be able to
>>> be
>>> > read, the message would eventually get through).
>>> >
>>> > But you now have the problem that the sender needs to know how many
>>> > destinations the logs are going to. If you have any filters to decide
>>> what
>>> > to do with the logs, the sender needs to know if the log got lost, or
>>> if
>>> a
>>> > filter decided to not write the log. If the rules would deliver the
>>> logs
>>> to
>>> > multiple places, the sender will need to know how many places it's
>>> going
>>> to
>>> > be delivered to so that it can know how many different acks it's
>>> supposed
>>> > to get back.
>>> >
>>>
>>> So the design expects clusters to be broken in multiple tiers. Let us
>>> take
>>> a 3 tier example.
>>>
>>> Say we have 100 machines, we break them into 3 tiers of 34, 33 and 33
>>> machines.
>>>
>>> Assuming every producer wants at the most 2-safety, I can use 3 tiers to
>>> build this design.
>>>
>>> So first producer discovers Tier-1 nodes, and hashes its session_id to
>>> pick
>>> one of the 34 nodes, if it is not able to connect, it discards that node
>>> from the collection(now we end up with 33 nodes in Tier-1) and hashes to
>>> a
>>> different node (again a Tier-1 node, of-course).
>>>
>>> One it finds a node that it can connect to, it sends its session_id and
>>> message-batch to it.
>>>
>>> The selected Tier-1 node now hashes the session_id and finds a Tier-2
>>> node
>>> (it again discovers all Tier-2 nodes via external discovery mechanism).
>>> If
>>> it fails to connect, it discards that node and hashes again to one of the
>>> remaining 32 nodes, and so on.
>>>
>>> Eventually it reaches Tier-3, which is where ruleset has a clause which
>>> checks for replica_number == 1, and handles the message differently. It
>>> is
>>> handed over to an action which delivers it to the downstream system
>>> (which
>>> may in-turn again be a syslog-cluster, or a datastore etc).
>>>
>>> So each node only has to worry about the next hop that it needs to
>>> deliver
>>> to.
>>>
>>>
>>> >
>>> > These problems make it so that I don't see how you would reasonably
>>> manage
>>> > this sort of environment.
>>> >
>>> >
>>> >
>>> > I would suggest that you think hard about what your requirements really
>>> > are.
>>> >
>>> > It may be that you are only sending to one place, in which case, you
>>> > really want to just be inserting your messages into an ACID complient
>>> > database.
>>> >
>>> > It may be that your requirements for absolute reliability are not quite
>>> as
>>> > severe as you are initially thinking that they are, and that you can
>>> then
>>> > use the existing hop-by-hop reliability. Or they are even less severe
>>> and
>>> > you can accept some amount of memory buffering to get a few orders of
>>> > magnatude better performance from your logging. Remember that we are
>>> > talking about performance differences of 10,000x on normal hardware. A
>>> bit
>>> > less, but still 100x or so on esoteric, high-end hardware.
>>> >
>>> >
>>> Yep, I completely agree. In most cases extreme reliability such as this
>>> is
>>> not required, and is best avoided for cost reasons.
>>>
>>> But for select applications it is lifesaver.
>>>
>>>
>>> >
>>> > I will also say that there are messaging systems that claim to have the
>>> > properties that you are looking for (Flume for example), but almost
>>> nobody
>>> > operates them in their full reliability mode because of the performance
>>> > issues. And they do not have the filtering and multiple destination
>>> > capabilities that *syslog provides.
>>> >
>>>
>>> Yes, Flume is one of the best options. But it comes with some unique
>>> problems too (its not light-weight enough for running producer side +
>>> managed-environment overhead (GC etc) cause their own set of problems).
>>> There is also value in offering the same interface to producers for
>>> ingestion into un-acked and reliable pipeline (because a lot of other
>>> things, like integration with other systems can be reused). It also keeps
>>> things simple because producers do all operations in one way, with one
>>> tool, regardless of its ingestion mechanism being acked/replicated etc.
>>>
>>> Reliability in this case is built end-to-end, so building stronger
>>> guarantees over-the-wire parts of the pipeline doesn't seem very valuable
>>> to me. Why do you feel RELP will be necessary?
>>>
>>>
>>> >
>>> > David Lang
>>> >
>>> >
>>> >
>>> > On Sat, 24 Jan 2015, singh.janmejay wrote:
>>> >
>>> >  Date: Sat, 24 Jan 2015 01:48:18 +0530
>>> >> From: singh.janmejay <[email protected]>
>>> >> Reply-To: rsyslog-users <[email protected]>
>>> >> To: rsyslog-users <[email protected]>
>>> >> Subject: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable
>>> >> 'at-least-once'
>>> >>     message delivery at large scale
>>> >>
>>> >>
>>> >> Greetings,
>>> >>
>>> >> This is a proposal for new-feature, and im inviting thoughts.
>>> >>
>>> >> The aim is to use a set of rsyslog nodes(let us call it a cluster) to
>>> be
>>> >> able to move messages reliably from source to destination.
>>> >>
>>> >> Let us make a few assumptions so we can define the expected properties
>>> >> clearly.
>>> >>
>>> >>
>>> >> Assumptions:
>>> >>
>>> >> - Data once successfully delivered to the Destination (typically a
>>> >> datastore) is considered safe.
>>> >> - Source-crashing with incomplete message hand-off to the cluster is
>>> >> outside the scope of this. In such a case, source must retry.
>>> >> - The cluster must be designed to support a maximum of K node failures
>>> >> without any message loss
>>> >>
>>> >>
>>> >> Here are the properties that may be desirable in such a service(the
>>> >> cluster
>>> >> is implementation of this service):
>>> >>
>>> >> - No message should ever be lost once handed over to the
>>> delivery-network
>>> >> except in a disaster scenario
>>> >> - Disaster scenario is a condition where more than k nodes in the
>>> cluster
>>> >> fail
>>> >> - Each source may pick a desirable value of k, where (k <= K)
>>> >> - Any cluster nodes must re-transmit messages at a timeout T, if
>>> >> downstream
>>> >> fails to ACK it before the timeout.
>>> >> - Such a cluster should ideally be composable, in the sense, user
>>> should
>>> >> be
>>> >> able to chain multiple such clusters.
>>> >>
>>> >>
>>> >> This requires the cluster to support k-way replication of messages in
>>> the
>>> >> cluster.
>>> >>
>>> >> Implementation:
>>> >>
>>> >> High level:
>>> >> - The cluster is divided in multiple tiers (let us call them
>>> >> replication-tiers (or rep-tiers).
>>> >> - The cluster can handle multiple sessions at a time.
>>> >> - Session_ids are unique and are generated by producer system when
>>> they
>>> >> start producing messages
>>> >> - Within a session, we have a notion of sequence-number (or seq_no),
>>> which
>>> >> is a monotonically increasing number(incremented by 1 per message).
>>> This
>>> >> requirement can possibly be relaxed for performance reasons, and gaps
>>> in
>>> >> seq-id may be acceptable.
>>> >> - Replication is basically managed by lower tiers sending data over to
>>> >> higher tiers within the cluster, until replica-number (an attribute
>>> each
>>> >> message carries, falls to 1)
>>> >> - When replica-number falls to zero, we transmit message to desired
>>> >> destination. (This can alternatively be done at the earliest
>>> opportunity,
>>> >> i.e. in Tier-1, under special-circumstances, but let us discuss that
>>> later
>>> >> if we find enough interest in doing so).
>>> >> - There must be several nodes in each Tier, allocated to minimize
>>> >> possibility of all of them going down at once (across availability
>>> zones,
>>> >> different chassis etc).
>>> >> - There must be a mechanism which allows nodes from upstream system to
>>> >> discover nodes of Tier-1 of the cluster, and Tier-1 nodes to discover
>>> >> nodes
>>> >> in Tier-2 of the cluster and so on. Hence nodes in Tier-K of the
>>> cluster
>>> >> should be able to discover downstream nodes.
>>> >> - Each session (or multiple sessions bundled according to arbitrary
>>> logic,
>>> >> such as hashing), must pick one node from each tier as
>>> >> downstream-tier-node.
>>> >> - Each node must maintain 2 watermarks:
>>> >>    * Replicated till seq_no : till what sequence number have messages
>>> been
>>> >> k-way replicated in the cluster
>>> >>    * Delivered till seq_no: till what sequence number have messages
>>> been
>>> >> delivered to downstream system
>>> >> - Each send-operation (i.e. transmission of messages) from upstream to
>>> >> cluster's Tier-1 or from lower tier in cluster to higher tier in
>>> cluster
>>> >> will pass messages such that highest seq_no of any message(per
>>> session)
>>> in
>>> >> transmitted batch is known
>>> >> - Each receive-operation in cluster's Tier-1 or in upper-tiers within
>>> >> cluster must respond/reply to transmitter with the two water-mark
>>> values
>>> >> (i.e Replicated seq_no and Delivered seq_no) per session.
>>> >> - Lower tiers (within the cluster) are free to discard messages all
>>> >> message
>>> >> with seq_no <= Delivered till seq_no
>>> >> - Upstream system is free to discard all messages with seq_no <=
>>> >> Replicated
>>> >> till seq_no of cluster
>>> >> - Upstream and downstream systems can be chained as instances of such
>>> >> clusters if need be
>>> >> - Maximum replication factor 'K' is dictated by cluster design (number
>>> of
>>> >> tiers)
>>> >> - Desired replication factor 'k' is a per-message controllable
>>> attribute
>>> >> (decided by the upstream)
>>> >>
>>> >> The sequence-diagrams below explain this visually:
>>> >>
>>> >> Here is a case with an upstream sending messages with k = K :
>>> >> ingestion_relay_1_max_replication.png
>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dN21TLTZBQjZMdUk/
>>> >> edit?usp=drive_web>
>>> >>
>>> >> This is a case with k < K :
>>> >> ingestion_relay_2_low_replication.png
>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4da1lKMnRKdU9JUkU/
>>> >> edit?usp=drive_web>
>>> >> 
>>> >> The above 2 cases show only one transmission going from upstream
>>> system
>>> to
>>> >> downstream system serially, this shows it pipelined :
>>> >> ingestion_relay_3_pipelining.png
>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dQUpTZGRDdVVXLVU/
>>> >> edit?usp=drive_web>
>>> >> 
>>> >> This demonstrates failure of a node in the cluster, and how it
>>> recovers
>>> in
>>> >> absence of continued transmission (it is recovered by timeout and
>>> >> retransmission) :
>>> >> ingestion_relay_4_timeout_based_recovery.png
>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dMm5kUWtaTlVfV1U/
>>> >> edit?usp=drive_web>
>>> >> 
>>> >> This demonstrates failure of a node in the cluster, and how it
>>> recovers
>>> >> due
>>> >> to continued transmission :
>>> >> ingestion_relay_5_broken_transmission_based_recovery.png
>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dd3M0SXpUYjFXdlk/
>>> >> edit?usp=drive_web>
>>> >>
>>> >> 
>>> >>
>>> >> Rsyslog level implementation sketch:
>>> >>
>>> >> - Let us assume there is a way to identify the set of inputs, queues,
>>> >> rulesets and actions that need to participate as reliable pipeline
>>> >> components in a cluster node
>>> >> - Each participating queue, will expect messages to contain a
>>> session-id
>>> >> - Consumer bound to a queue will be expected to provide values for
>>> both
>>> >> watermarks to per-session to dequeue more messages.
>>> >> - Producer bound to a queue will be provided values for both
>>> watermarks
>>> >> per-session as return value when en-queueing more messages.
>>> >> - The inputs will transmit (either broadcast or unicast) both
>>> watermark
>>> >> values to upstream actions (unicast is sent over relevant connections,
>>> >> broadcast is sent across all connections) (please note this has
>>> nothing
>>> to
>>> >> do with network broadcast domains, as everything is over TCP).
>>> >> - Actions will receive the two watermarks and push it back to the
>>> queue
>>> >> action is bound to, in order to dequeue more messages
>>> >> - Rulesets will need to pick the relevant actions value across
>>> multiple
>>> >> action-queues according to user-provided configuration, and propagate
>>> it
>>> >> backwards
>>> >> - Action must have ability to set arbitrarily value for replica-number
>>> >> when
>>> >> passing it to downstream-system (so that chaining is possible).
>>> >> - Inputs may produce the new value for replicated till seq_no when
>>> >> receiving a message with replica_number == 1
>>> >> - Action may produce the new value for delivered till seq_no after
>>> having
>>> >> successfully delivered a message with replica_number == 1
>>> >>
>>> >> Rsyslog configuration required(from user):
>>> >>
>>> >> - User will need to identify machines that are a part of cluster
>>> >> - These machines will have to be divided in multiple replication tiers
>>> (as
>>> >> replication will happen only across machines in different tiers)
>>> >> - User can pass message to the next cluster by setting replica_number
>>> back
>>> >> to a desired number and passing it to an action which writes it to one
>>> of
>>> >> the nodes in a downstream cluster
>>> >> - User needs to check replica_number in the ruleset and take special
>>> >> action
>>> >> (to write it to downstream system) when replica_number == 1
>>> >>
>>> >>
>>> >> Does this have any overlap with RELP?
>>> >>
>>> >> I haven't studied RELP in depth yet, but as far as I understand it, it
>>> >> tries to solve the problem of delivering messages reliably between a
>>> >> single-producer and a single-consumer losslessly (it targets different
>>> >> kind
>>> >> of loss scenarios specifically). In addition to this, its scope is
>>> limited
>>> >> to ensuring no messages are lost during transportation. In event of a
>>> >> crash
>>> >> of the receiver node before it can handle received message reliably,
>>> some
>>> >> messages may be lost. Someone with deeper knowledge of RELP should
>>> chime
>>> >> in.
>>> >>
>>> >>
>>> >>
>>> >> Thoughts?
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Regards,
>>> >> Janmejay
>>> >> http://codehunk.wordpress.com
>>> >> _______________________________________________
>>> >> rsyslog mailing list
>>> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> >> http://www.rsyslog.com/professional-services/
>>> >> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>> myriad
>>> >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> >> DON'T LIKE THAT.
>>> >
>>> >
>>> > _______________________________________________
>>> > rsyslog mailing list
>>> > http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> > http://www.rsyslog.com/professional-services/
>>> > What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>> myriad
>>> > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> > DON'T LIKE THAT.
>>> >
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Janmejay
>>> http://codehunk.wordpress.com
>>> _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> DON'T LIKE THAT.
>>>
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>
>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable 'at-least-once' message delivery at large scale

Reply via email to