Re: [rsyslog] omelasticsearch - failed operation handling
maybe the actual code will explain what I intend: https://github.com/rsyslog/rsyslog/pull/2733 On 05/18/2018 10:52 AM, Rainer Gerhards wrote: Just quicky chiming in, will need to catch a plane early tomorrow morning. It's complicated. At this point, the original message is no longer available, as omelsticsearch works with batches, but the rule engine needs to process message by message (we had to change that some time ago). The messages are still in batches, but modifications happen to the message so they need to go through individually. Needs more explanation, for which I currently have no time. So we need to either create a new rsyslog core to plugin interface or do something omelsticsearch específico. I can elaborate more at the end of May. Rainer Sent from phone, thus brief. David Lang> schrieb am Do., 17. Mai 2018, 18:25: On Thu, 17 May 2018, Rich Megginson wrote: >> then you can mark the ones accepted as done and just retry the ones that >> fail. > > That's what I'm proposing. > >> But there's still no need for a separate ruleset and queue. In Rsyslog, if >> an output cannot accept a message and there's reason to think that it will >> in the future, then you suspend that output and try again later. If you >> have reason to believe that the message is never going to be able to be >> delivered, then you need to fail the message or you will be stuck forever. >> This is what the error output was made for. > > So how would that work on a per-record basis? > > Would this be something different than using MsgConstruct -> set fields in > msg from original request -> ratelimitAddMsg for each record to resubmit? Rainer, in a batch, is there any way to mark some of the messages as delivered and others as failed as opposed to failing the entire batch? >> >>> If using the "index" (default) bulk type, this causes duplicate records to >>> be added. >>> If using the "create" type (and you have assigned a unique _id), you will >>> get back many 409 Duplicate errors. >>> This causes problems - we know because this is how the fluentd plugin used >>> to work, which is why we had to change it. >>> >>> https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section >>> "Bulk Rejections" >>> "It is much better to handle queuing in your application by gracefully >>> handling the back pressure from a full queue. When you receive bulk >>> rejections, you should take these steps: >>> >>> Pause the import thread for 3–5 seconds. >>> Extract the rejected actions from the bulk response, since it is >>> probable that many of the actions were successful. The bulk response will >>> tell you which succeeded and which were rejected. >>> Send a new bulk request with just the rejected actions. >>> Repeat from step 1 if rejections are encountered again. >>> >>> Using this procedure, your code naturally adapts to the load of your >>> cluster and naturally backs off. >>> " >> >> Does it really accept some and reject some in a random manner? or is it a >> matter of accepting the first X and rejecting any after that point? The >> first is easier to deal with. > > It appears to be random. So you may get a failure from the first record in > the batch and the last record in the batch, and success for the others. Or > vice versa. There appear to be many, many factors in the tuning, hardware, > network, etc. that come into play. > > There isn't an easy way to deal with this :P > >> >> >> Batch mode was created to be able to more efficiently process messages that >> are inserted into databases, we then found that the reduced queue >> congestion was a significant advantage in itself. >> >> But unless you have a queue just for the ES action, > > That's what we had to do for the fluentd case - we have a separate "ES retry > queue". One of the tricky parts is that there may be multiple outputs - you > may want to send each log record to Elasticsearch _and_ a message bus _and_ a > remote rsyslog forwarder. But you only want to retry sending to Elasticsearch > to avoid duplication in the other outputs. In Rsyslog, queues are explicitly configured by the admin (for various reasons, including performance and reliability trade-offs), I really don't like the idea of omelasticsearch creating it's own queue without these options. Kafka does this and it's an ongoing source of problems. ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog
Re: [rsyslog] omelasticsearch - failed operation handling
Just quicky chiming in, will need to catch a plane early tomorrow morning. It's complicated. At this point, the original message is no longer available, as omelsticsearch works with batches, but the rule engine needs to process message by message (we had to change that some time ago). The messages are still in batches, but modifications happen to the message so they need to go through individually. Needs more explanation, for which I currently have no time. So we need to either create a new rsyslog core to plugin interface or do something omelsticsearch específico. I can elaborate more at the end of May. Rainer Sent from phone, thus brief. David Langschrieb am Do., 17. Mai 2018, 18:25: > On Thu, 17 May 2018, Rich Megginson wrote: > > >> then you can mark the ones accepted as done and just retry the ones > that > >> fail. > > > > That's what I'm proposing. > > > >> But there's still no need for a separate ruleset and queue. In Rsyslog, > if > >> an output cannot accept a message and there's reason to think that it > will > >> in the future, then you suspend that output and try again later. If you > >> have reason to believe that the message is never going to be able to be > >> delivered, then you need to fail the message or you will be stuck > forever. > >> This is what the error output was made for. > > > > So how would that work on a per-record basis? > > > > Would this be something different than using MsgConstruct -> set fields > in > > msg from original request -> ratelimitAddMsg for each record to resubmit? > > Rainer, in a batch, is there any way to mark some of the messages as > delivered > and others as failed as opposed to failing the entire batch? > > >> > >>> If using the "index" (default) bulk type, this causes duplicate > records to > >>> be added. > >>> If using the "create" type (and you have assigned a unique _id), you > will > >>> get back many 409 Duplicate errors. > >>> This causes problems - we know because this is how the fluentd plugin > used > >>> to work, which is why we had to change it. > >>> > >>> > https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section > >>> "Bulk Rejections" > >>> "It is much better to handle queuing in your application by gracefully > >>> handling the back pressure from a full queue. When you receive bulk > >>> rejections, you should take these steps: > >>> > >>> Pause the import thread for 3–5 seconds. > >>> Extract the rejected actions from the bulk response, since it is > >>> probable that many of the actions were successful. The bulk response > will > >>> tell you which succeeded and which were rejected. > >>> Send a new bulk request with just the rejected actions. > >>> Repeat from step 1 if rejections are encountered again. > >>> > >>> Using this procedure, your code naturally adapts to the load of your > >>> cluster and naturally backs off. > >>> " > >> > >> Does it really accept some and reject some in a random manner? or is it > a > >> matter of accepting the first X and rejecting any after that point? The > >> first is easier to deal with. > > > > It appears to be random. So you may get a failure from the first record > in > > the batch and the last record in the batch, and success for the others. > Or > > vice versa. There appear to be many, many factors in the tuning, > hardware, > > network, etc. that come into play. > > > > There isn't an easy way to deal with this :P > > > >> > >> > >> Batch mode was created to be able to more efficiently process messages > that > >> are inserted into databases, we then found that the reduced queue > >> congestion was a significant advantage in itself. > >> > >> But unless you have a queue just for the ES action, > > > > That's what we had to do for the fluentd case - we have a separate "ES > retry > > queue". One of the tricky parts is that there may be multiple outputs - > you > > may want to send each log record to Elasticsearch _and_ a message bus > _and_ a > > remote rsyslog forwarder. But you only want to retry sending to > Elasticsearch > > to avoid duplication in the other outputs. > > In Rsyslog, queues are explicitly configured by the admin (for various > reasons, > including performance and reliability trade-offs), I really don't like the > idea > of omelasticsearch creating it's own queue without these options. Kafka > does > this and it's an ongoing source of problems. ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] omelasticsearch - failed operation handling
On Thu, 17 May 2018, Rich Megginson wrote: then you can mark the ones accepted as done and just retry the ones that fail. That's what I'm proposing. But there's still no need for a separate ruleset and queue. In Rsyslog, if an output cannot accept a message and there's reason to think that it will in the future, then you suspend that output and try again later. If you have reason to believe that the message is never going to be able to be delivered, then you need to fail the message or you will be stuck forever. This is what the error output was made for. So how would that work on a per-record basis? Would this be something different than using MsgConstruct -> set fields in msg from original request -> ratelimitAddMsg for each record to resubmit? Rainer, in a batch, is there any way to mark some of the messages as delivered and others as failed as opposed to failing the entire batch? If using the "index" (default) bulk type, this causes duplicate records to be added. If using the "create" type (and you have assigned a unique _id), you will get back many 409 Duplicate errors. This causes problems - we know because this is how the fluentd plugin used to work, which is why we had to change it. https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section "Bulk Rejections" "It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps: Pause the import thread for 3–5 seconds. Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected. Send a new bulk request with just the rejected actions. Repeat from step 1 if rejections are encountered again. Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off. " Does it really accept some and reject some in a random manner? or is it a matter of accepting the first X and rejecting any after that point? The first is easier to deal with. It appears to be random. So you may get a failure from the first record in the batch and the last record in the batch, and success for the others. Or vice versa. There appear to be many, many factors in the tuning, hardware, network, etc. that come into play. There isn't an easy way to deal with this :P Batch mode was created to be able to more efficiently process messages that are inserted into databases, we then found that the reduced queue congestion was a significant advantage in itself. But unless you have a queue just for the ES action, That's what we had to do for the fluentd case - we have a separate "ES retry queue". One of the tricky parts is that there may be multiple outputs - you may want to send each log record to Elasticsearch _and_ a message bus _and_ a remote rsyslog forwarder. But you only want to retry sending to Elasticsearch to avoid duplication in the other outputs. In Rsyslog, queues are explicitly configured by the admin (for various reasons, including performance and reliability trade-offs), I really don't like the idea of omelasticsearch creating it's own queue without these options. Kafka does this and it's an ongoing source of problems. ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] omelasticsearch - failed operation handling
On 05/17/2018 05:52 AM, Brian Knox wrote: To my knowledge, Rich is correct. This also would explain a case we hit maybe every couple of months, where rsyslog very quickly duplicates some messages it is sending to elasticsearch. I would assume this would be a case where a batch is submitted, only some of the messages are rejected, and rsyslog then duplicates messages trying to send the batch over and over again. You can confirm this by monitoring the bulk index thread pool https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html to see if you are getting bulk rejections. On Thu, May 17, 2018 at 12:08 AM David Lang> wrote: On Wed, 16 May 2018, Rich Megginson wrote: > On 05/16/2018 05:58 PM, David Lang wrote: >> there's no need to add this extra complexity (multiple rulesets and queues) >> >> What should be happening (on any output module) is: >> >> submit a batch. >> If rejected with a soft error, retry/suspend the output > > retry of the entire batch? see below > >> if batch-size=1 and a hard error, send to errorfile >> if rejected with a hard error resubmit half of the batch > > But what if 90% of the batch was successfully added? Then you are needlessly > resubmitting many of the records in the batch. when submitting batches, you get a success/fail for the batch as a whole (for 99% of things that actually allow you to insert in batches), so you don't know what message failed. This is a database transaction (again, in most cases), so if a batch fails, all you can do is bisect to figure out what message fails. If the endpoint is inserting some of the messages from a batch that fails, that's usually a bad thing. now, if ES batch mode isn't an ACID transaction and it accepts some messages and then tells you which ones failed, then you can mark the ones accepted as done and just retry the ones that fail. But there's still no need for a separate ruleset and queue. In Rsyslog, if an output cannot accept a message and there's reason to think that it will in the future, then you suspend that output and try again later. If you have reason to believe that the message is never going to be able to be delivered, then you need to fail the message or you will be stuck forever. This is what the error output was made for. > If using the "index" (default) bulk type, this causes duplicate records to be > added. > If using the "create" type (and you have assigned a unique _id), you will get > back many 409 Duplicate errors. > This causes problems - we know because this is how the fluentd plugin used to > work, which is why we had to change it. > > https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section > "Bulk Rejections" > "It is much better to handle queuing in your application by gracefully > handling the back pressure from a full queue. When you receive bulk > rejections, you should take these steps: > > Pause the import thread for 3–5 seconds. > Extract the rejected actions from the bulk response, since it is probable > that many of the actions were successful. The bulk response will tell you > which succeeded and which were rejected. > Send a new bulk request with just the rejected actions. > Repeat from step 1 if rejections are encountered again. > > Using this procedure, your code naturally adapts to the load of your cluster > and naturally backs off. > " Does it really accept some and reject some in a random manner? or is it a matter of accepting the first X and rejecting any after that point? The first is easier to deal with. Batch mode was created to be able to more efficiently process messages that are inserted into databases, we then found that the reduced queue congestion was a significant advantage in itself. But unless you have a queue just for the ES action, doing queue manipulation isn't possible, all you can do is succeed or fail, and if you fail, the retry logic will kick in. Rainer is going to need to comment on this. David Lang > >> repeat >> >> all that should be needed is to add tests into omelasticsearch to detect >> the soft errors and turn them into retries (or suspend the output as >> appropriate) >> >> David Lang > > > ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are
Re: [rsyslog] omelasticsearch - failed operation handling
On 05/16/2018 10:08 PM, David Lang wrote: On Wed, 16 May 2018, Rich Megginson wrote: On 05/16/2018 05:58 PM, David Lang wrote: there's no need to add this extra complexity (multiple rulesets and queues) What should be happening (on any output module) is: submit a batch. If rejected with a soft error, retry/suspend the output retry of the entire batch? see below if batch-size=1 and a hard error, send to errorfile if rejected with a hard error resubmit half of the batch But what if 90% of the batch was successfully added? Then you are needlessly resubmitting many of the records in the batch. when submitting batches, you get a success/fail for the batch as a whole (for 99% of things that actually allow you to insert in batches), For Elasticsearch - yes, there is a top level "errors" field in the response with a binary value true or false. true means all records in the batch were successfully processed. false means _at least one_ record in the batch was not processed successfully. For example, in a batch of 1 records, you will get an response of "errors": true if of those records were successfully processed. so you don't know what message failed. You do know exactly which record failed and in most cases what the error was. Here is an example from the fluent-plugin-elasticsearch unit test: https://github.com/uken/fluent-plugin-elasticsearch/blob/master/test/plugin/test_elasticsearch_error_handler.rb#L88 This is what the response looks like coming from Elasticsearch. You get a separate response item for every record submitted in the bulk request. In addition, you are guaranteed that the order of the items in the response is exactly the same as the order of the items submitted in the bulk request, so that you can exactly correlate the request object with the response. This is a database transaction (again, in most cases), Not in Elasticsearch at the bulk index level. Probably at the very low level where lucene hits the disk. so if a batch fails, all you can do is bisect to figure out what message fails. If the endpoint is inserting some of the messages from a batch that fails, that's usually a bad thing. now, if ES batch mode isn't an ACID transaction and it accepts some messages and then tells you which ones failed, It does then you can mark the ones accepted as done and just retry the ones that fail. That's what I'm proposing. But there's still no need for a separate ruleset and queue. In Rsyslog, if an output cannot accept a message and there's reason to think that it will in the future, then you suspend that output and try again later. If you have reason to believe that the message is never going to be able to be delivered, then you need to fail the message or you will be stuck forever. This is what the error output was made for. So how would that work on a per-record basis? Would this be something different than using MsgConstruct -> set fields in msg from original request -> ratelimitAddMsg for each record to resubmit? If using the "index" (default) bulk type, this causes duplicate records to be added. If using the "create" type (and you have assigned a unique _id), you will get back many 409 Duplicate errors. This causes problems - we know because this is how the fluentd plugin used to work, which is why we had to change it. https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section "Bulk Rejections" "It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps: Pause the import thread for 3–5 seconds. Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected. Send a new bulk request with just the rejected actions. Repeat from step 1 if rejections are encountered again. Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off. " Does it really accept some and reject some in a random manner? or is it a matter of accepting the first X and rejecting any after that point? The first is easier to deal with. It appears to be random. So you may get a failure from the first record in the batch and the last record in the batch, and success for the others. Or vice versa. There appear to be many, many factors in the tuning, hardware, network, etc. that come into play. There isn't an easy way to deal with this :P Batch mode was created to be able to more efficiently process messages that are inserted into databases, we then found that the reduced queue congestion was a significant advantage in itself. But unless you have a queue just for the ES action, That's what we had to do for the fluentd case - we have a separate "ES retry
Re: [rsyslog] omelasticsearch - failed operation handling
To my knowledge, Rich is correct. This also would explain a case we hit maybe every couple of months, where rsyslog very quickly duplicates some messages it is sending to elasticsearch. I would assume this would be a case where a batch is submitted, only some of the messages are rejected, and rsyslog then duplicates messages trying to send the batch over and over again. On Thu, May 17, 2018 at 12:08 AM David Langwrote: > On Wed, 16 May 2018, Rich Megginson wrote: > > > On 05/16/2018 05:58 PM, David Lang wrote: > >> there's no need to add this extra complexity (multiple rulesets and > queues) > >> > >> What should be happening (on any output module) is: > >> > >> submit a batch. > >>If rejected with a soft error, retry/suspend the output > > > > retry of the entire batch? see below > > > >> if batch-size=1 and a hard error, send to errorfile > >>if rejected with a hard error resubmit half of the batch > > > > But what if 90% of the batch was successfully added? Then you are > needlessly > > resubmitting many of the records in the batch. > > when submitting batches, you get a success/fail for the batch as a whole > (for > 99% of things that actually allow you to insert in batches), so you don't > know > what message failed. This is a database transaction (again, in most > cases), so > if a batch fails, all you can do is bisect to figure out what message > fails. If > the endpoint is inserting some of the messages from a batch that fails, > that's > usually a bad thing. > > now, if ES batch mode isn't an ACID transaction and it accepts some > messages and > then tells you which ones failed, then you can mark the ones accepted as > done > and just retry the ones that fail. But there's still no need for a > separate > ruleset and queue. In Rsyslog, if an output cannot accept a message and > there's > reason to think that it will in the future, then you suspend that output > and try > again later. If you have reason to believe that the message is never going > to be > able to be delivered, then you need to fail the message or you will be > stuck > forever. This is what the error output was made for. > > > If using the "index" (default) bulk type, this causes duplicate records > to be > > added. > > If using the "create" type (and you have assigned a unique _id), you > will get > > back many 409 Duplicate errors. > > This causes problems - we know because this is how the fluentd plugin > used to > > work, which is why we had to change it. > > > > > https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section > > "Bulk Rejections" > > "It is much better to handle queuing in your application by gracefully > > handling the back pressure from a full queue. When you receive bulk > > rejections, you should take these steps: > > > > Pause the import thread for 3–5 seconds. > > Extract the rejected actions from the bulk response, since it is > probable > > that many of the actions were successful. The bulk response will tell > you > > which succeeded and which were rejected. > > Send a new bulk request with just the rejected actions. > > Repeat from step 1 if rejections are encountered again. > > > > Using this procedure, your code naturally adapts to the load of your > cluster > > and naturally backs off. > > " > > Does it really accept some and reject some in a random manner? or is it a > matter > of accepting the first X and rejecting any after that point? The first is > easier > to deal with. > > Batch mode was created to be able to more efficiently process messages > that are > inserted into databases, we then found that the reduced queue congestion > was a > significant advantage in itself. > > But unless you have a queue just for the ES action, doing queue > manipulation > isn't possible, all you can do is succeed or fail, and if you fail, the > retry > logic will kick in. > > Rainer is going to need to comment on this. > > David Lang > > > > >> repeat > >> > >> all that should be needed is to add tests into omelasticsearch to > detect > >> the soft errors and turn them into retries (or suspend the output as > >> appropriate) > >> > >> David Lang > > > > > > > ___ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
Re: [rsyslog] omelasticsearch - failed operation handling
On Wed, 16 May 2018, Rich Megginson wrote: On 05/16/2018 05:58 PM, David Lang wrote: there's no need to add this extra complexity (multiple rulesets and queues) What should be happening (on any output module) is: submit a batch. If rejected with a soft error, retry/suspend the output retry of the entire batch? see below if batch-size=1 and a hard error, send to errorfile if rejected with a hard error resubmit half of the batch But what if 90% of the batch was successfully added? Then you are needlessly resubmitting many of the records in the batch. when submitting batches, you get a success/fail for the batch as a whole (for 99% of things that actually allow you to insert in batches), so you don't know what message failed. This is a database transaction (again, in most cases), so if a batch fails, all you can do is bisect to figure out what message fails. If the endpoint is inserting some of the messages from a batch that fails, that's usually a bad thing. now, if ES batch mode isn't an ACID transaction and it accepts some messages and then tells you which ones failed, then you can mark the ones accepted as done and just retry the ones that fail. But there's still no need for a separate ruleset and queue. In Rsyslog, if an output cannot accept a message and there's reason to think that it will in the future, then you suspend that output and try again later. If you have reason to believe that the message is never going to be able to be delivered, then you need to fail the message or you will be stuck forever. This is what the error output was made for. If using the "index" (default) bulk type, this causes duplicate records to be added. If using the "create" type (and you have assigned a unique _id), you will get back many 409 Duplicate errors. This causes problems - we know because this is how the fluentd plugin used to work, which is why we had to change it. https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section "Bulk Rejections" "It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps: Pause the import thread for 3–5 seconds. Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected. Send a new bulk request with just the rejected actions. Repeat from step 1 if rejections are encountered again. Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off. " Does it really accept some and reject some in a random manner? or is it a matter of accepting the first X and rejecting any after that point? The first is easier to deal with. Batch mode was created to be able to more efficiently process messages that are inserted into databases, we then found that the reduced queue congestion was a significant advantage in itself. But unless you have a queue just for the ES action, doing queue manipulation isn't possible, all you can do is succeed or fail, and if you fail, the retry logic will kick in. Rainer is going to need to comment on this. David Lang repeat all that should be needed is to add tests into omelasticsearch to detect the soft errors and turn them into retries (or suspend the output as appropriate) David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] omelasticsearch - failed operation handling
On 05/16/2018 05:58 PM, David Lang wrote: there's no need to add this extra complexity (multiple rulesets and queues) What should be happening (on any output module) is: submit a batch. If rejected with a soft error, retry/suspend the output retry of the entire batch? see below if batch-size=1 and a hard error, send to errorfile if rejected with a hard error resubmit half of the batch But what if 90% of the batch was successfully added? Then you are needlessly resubmitting many of the records in the batch. If using the "index" (default) bulk type, this causes duplicate records to be added. If using the "create" type (and you have assigned a unique _id), you will get back many 409 Duplicate errors. This causes problems - we know because this is how the fluentd plugin used to work, which is why we had to change it. https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section "Bulk Rejections" "It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps: Pause the import thread for 3–5 seconds. Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected. Send a new bulk request with just the rejected actions. Repeat from step 1 if rejections are encountered again. Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off. " repeat all that should be needed is to add tests into omelasticsearch to detect the soft errors and turn them into retries (or suspend the output as appropriate) David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] omelasticsearch - failed operation handling
there's no need to add this extra complexity (multiple rulesets and queues) What should be happening (on any output module) is: submit a batch. If rejected with a soft error, retry/suspend the output if batch-size=1 and a hard error, send to errorfile if rejected with a hard error resubmit half of the batch repeat all that should be needed is to add tests into omelasticsearch to detect the soft errors and turn them into retries (or suspend the output as appropriate) David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
[rsyslog] omelasticsearch - failed operation handling
In many cases, when adding a record to Elasticsearch, an http status not 200 or 201 does not necessarily indicate that the record cannot be added. One case is bulk index rejection - in this case, the http status for the record in the response is 429, and it may be that a short pause is required before resubmitting the record. omelasticsearch has support for an errorfile, but this requires the operator to examine the file and resubmit it manually. The fluentd elasticsearch plugin recently got support for better error handling: https://github.com/richm/docs/blob/master/fluent-plugin-elasticsearch-retry.md I would like to do the same with omelasticsearch. It would work something like this: - a record that fails with a "soft" error would be sent to a "retry queue" - a record that fails with a "hard" error would be sent to an "error queue" In fluentd this is best accomplished with judicious use of tagging and labeling. I think for rsyslog, this would best be handled by rulesets - a retry_es ruleset, and an error ruleset. We have the original request in JSON string form, and the response JSON contains 1 response object for each request object, in the exact same order. pseudo code: if response is an error for each item in response get the corresponding request string convert request json string to json object MsgNew - set Msg fields if is soft error MsgSetRuleset retry_es ruleset else MsgSetRuleset error ruleset ratelimitAddMsg Note that status 200 and 201 are success, and status 409 when using the "create" operation (I'm also working on adding support for this) is a duplicate and is considered successful. Without using a ruleset, I assume a message would be submitted at the "top" of the processing pipeline, and would require a lot of work to make that pipeline idempotent in most cases. The config would look something like this: ruleset(name="error_es") { action(type="omfile" ... write hard failures to error file ...) } ruleset(name="try_es") { action(type="omelasticsearch" retryRuleset="try_es" errorRuleset="error_es" ...) } ... normal pipeline ... call try_es # for both the normal case and the retry case ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.