Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-23 Thread Rich Megginson via rsyslog
maybe the actual code will explain what I intend: 
https://github.com/rsyslog/rsyslog/pull/2733


On 05/18/2018 10:52 AM, Rainer Gerhards wrote:
Just quicky chiming in, will need to catch a plane early tomorrow 
morning.


It's complicated. At this point, the original message is no longer 
available, as omelsticsearch works with batches, but the rule engine 
needs to process message by message (we had to change that some time 
ago). The messages are still in batches, but modifications happen to 
the message so they need to go through individually. Needs more 
explanation, for which I currently have no time.


So we need to either create a new rsyslog core to plugin interface or 
do something omelsticsearch específico.


I can elaborate more at the end of May.

Rainer

Sent from phone, thus brief.

David Lang > schrieb am Do., 17. 
Mai 2018, 18:25:


On Thu, 17 May 2018, Rich Megginson wrote:

>> then you can mark the ones accepted as done and just retry the
ones that
>> fail.
>
> That's what I'm proposing.
>
>> But there's still no need for a separate ruleset and queue. In
Rsyslog, if
>> an output cannot accept a message and there's reason to think
that it will
>> in the future, then you suspend that output and try again
later. If you
>> have reason to believe that the message is never going to be
able to be
>> delivered, then you need to fail the message or you will be
stuck forever.
>> This is what the error output was made for.
>
> So how would that work on a per-record basis?
>
> Would this be something different than using MsgConstruct -> set
fields in
> msg from original request -> ratelimitAddMsg for each record to
resubmit?

Rainer, in a batch, is there any way to mark some of the messages
as delivered
and others as failed as opposed to failing the entire batch?

>>
>>> If using the "index" (default) bulk type, this causes
duplicate records to
>>> be added.
>>> If using the "create" type (and you have assigned a unique
_id), you will
>>> get back many 409 Duplicate errors.
>>> This causes problems - we know because this is how the fluentd
plugin used
>>> to work, which is why we had to change it.
>>>
>>>

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section

>>> "Bulk Rejections"
>>> "It is much better to handle queuing in your application by
gracefully
>>> handling the back pressure from a full queue. When you receive
bulk
>>> rejections, you should take these steps:
>>>
>>>     Pause the import thread for 3–5 seconds.
>>>     Extract the rejected actions from the bulk response, since
it is
>>> probable that many of the actions were successful. The bulk
response will
>>> tell you which succeeded and which were rejected.
>>>     Send a new bulk request with just the rejected actions.
>>>     Repeat from step 1 if rejections are encountered again.
>>>
>>> Using this procedure, your code naturally adapts to the load
of your
>>> cluster and naturally backs off.
>>> "
>>
>> Does it really accept some and reject some in a random manner?
or is it a
>> matter of accepting the first X and rejecting any after that
point? The
>> first is easier to deal with.
>
> It appears to be random.  So you may get a failure from the
first record in
> the batch and the last record in the batch, and success for the
others.  Or
> vice versa.  There appear to be many, many factors in the
tuning, hardware,
> network, etc. that come into play.
>
> There isn't an easy way to deal with this :P
>
>>
>>
>> Batch mode was created to be able to more efficiently process
messages that
>> are inserted into databases, we then found that the reduced queue
>> congestion was a significant advantage in itself.
>>
>> But unless you have a queue just for the ES action,
>
> That's what we had to do for the fluentd case - we have a
separate "ES retry
> queue".  One of the tricky parts is that there may be multiple
outputs - you
> may want to send each log record to Elasticsearch _and_ a
message bus _and_ a
> remote rsyslog forwarder. But you only want to retry sending to
Elasticsearch
> to avoid duplication in the other outputs.

In Rsyslog, queues are explicitly configured by the admin (for
various reasons,
including performance and reliability trade-offs), I really don't
like the idea
of omelasticsearch creating it's own queue without these options.
Kafka does
this and it's an ongoing source of problems.



___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-18 Thread Rainer Gerhards
Just quicky chiming in, will need to catch a plane early tomorrow morning.

It's complicated. At this point, the original message is no longer
available, as omelsticsearch works with batches, but the rule engine needs
to process message by message (we had to change that some time ago). The
messages are still in batches, but modifications happen to the message so
they need to go through individually. Needs more explanation, for which I
currently have no time.

So we need to either create a new rsyslog core to plugin interface or do
something omelsticsearch específico.

I can elaborate more at the end of May.

Rainer

Sent from phone, thus brief.

David Lang  schrieb am Do., 17. Mai 2018, 18:25:

> On Thu, 17 May 2018, Rich Megginson wrote:
>
> >> then you can mark the ones accepted as done and just retry the ones
> that
> >> fail.
> >
> > That's what I'm proposing.
> >
> >> But there's still no need for a separate ruleset and queue. In Rsyslog,
> if
> >> an output cannot accept a message and there's reason to think that it
> will
> >> in the future, then you suspend that output and try again later. If you
> >> have reason to believe that the message is never going to be able to be
> >> delivered, then you need to fail the message or you will be stuck
> forever.
> >> This is what the error output was made for.
> >
> > So how would that work on a per-record basis?
> >
> > Would this be something different than using MsgConstruct -> set fields
> in
> > msg from original request -> ratelimitAddMsg for each record to resubmit?
>
> Rainer, in a batch, is there any way to mark some of the messages as
> delivered
> and others as failed as opposed to failing the entire batch?
>
> >>
> >>> If using the "index" (default) bulk type, this causes duplicate
> records to
> >>> be added.
> >>> If using the "create" type (and you have assigned a unique _id), you
> will
> >>> get back many 409 Duplicate errors.
> >>> This causes problems - we know because this is how the fluentd plugin
> used
> >>> to work, which is why we had to change it.
> >>>
> >>>
> https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
> >>> "Bulk Rejections"
> >>> "It is much better to handle queuing in your application by gracefully
> >>> handling the back pressure from a full queue. When you receive bulk
> >>> rejections, you should take these steps:
> >>>
> >>> Pause the import thread for 3–5 seconds.
> >>> Extract the rejected actions from the bulk response, since it is
> >>> probable that many of the actions were successful. The bulk response
> will
> >>> tell you which succeeded and which were rejected.
> >>> Send a new bulk request with just the rejected actions.
> >>> Repeat from step 1 if rejections are encountered again.
> >>>
> >>> Using this procedure, your code naturally adapts to the load of your
> >>> cluster and naturally backs off.
> >>> "
> >>
> >> Does it really accept some and reject some in a random manner? or is it
> a
> >> matter of accepting the first X and rejecting any after that point? The
> >> first is easier to deal with.
> >
> > It appears to be random.  So you may get a failure from the first record
> in
> > the batch and the last record in the batch, and success for the others.
> Or
> > vice versa.  There appear to be many, many factors in the tuning,
> hardware,
> > network, etc. that come into play.
> >
> > There isn't an easy way to deal with this :P
> >
> >>
> >>
> >> Batch mode was created to be able to more efficiently process messages
> that
> >> are inserted into databases, we then found that the reduced queue
> >> congestion was a significant advantage in itself.
> >>
> >> But unless you have a queue just for the ES action,
> >
> > That's what we had to do for the fluentd case - we have a separate "ES
> retry
> > queue".  One of the tricky parts is that there may be multiple outputs -
> you
> > may want to send each log record to Elasticsearch _and_ a message bus
> _and_ a
> > remote rsyslog forwarder. But you only want to retry sending to
> Elasticsearch
> > to avoid duplication in the other outputs.
>
> In Rsyslog, queues are explicitly configured by the admin (for various
> reasons,
> including performance and reliability trade-offs), I really don't like the
> idea
> of omelasticsearch creating it's own queue without these options. Kafka
> does
> this and it's an ongoing source of problems.
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-17 Thread David Lang

On Thu, 17 May 2018, Rich Megginson wrote:

then you can mark the ones accepted as done and just retry the ones that 
fail.


That's what I'm proposing.

But there's still no need for a separate ruleset and queue. In Rsyslog, if 
an output cannot accept a message and there's reason to think that it will 
in the future, then you suspend that output and try again later. If you 
have reason to believe that the message is never going to be able to be 
delivered, then you need to fail the message or you will be stuck forever. 
This is what the error output was made for.


So how would that work on a per-record basis?

Would this be something different than using MsgConstruct -> set fields in 
msg from original request -> ratelimitAddMsg for each record to resubmit?


Rainer, in a batch, is there any way to mark some of the messages as delivered 
and others as failed as opposed to failing the entire batch?




If using the "index" (default) bulk type, this causes duplicate records to 
be added.
If using the "create" type (and you have assigned a unique _id), you will 
get back many 409 Duplicate errors.
This causes problems - we know because this is how the fluentd plugin used 
to work, which is why we had to change it.


https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section 
"Bulk Rejections"
"It is much better to handle queuing in your application by gracefully 
handling the back pressure from a full queue. When you receive bulk 
rejections, you should take these steps:


    Pause the import thread for 3–5 seconds.
    Extract the rejected actions from the bulk response, since it is 
probable that many of the actions were successful. The bulk response will 
tell you which succeeded and which were rejected.

    Send a new bulk request with just the rejected actions.
    Repeat from step 1 if rejections are encountered again.

Using this procedure, your code naturally adapts to the load of your 
cluster and naturally backs off.

"


Does it really accept some and reject some in a random manner? or is it a 
matter of accepting the first X and rejecting any after that point? The 
first is easier to deal with.


It appears to be random.  So you may get a failure from the first record in 
the batch and the last record in the batch, and success for the others.  Or 
vice versa.  There appear to be many, many factors in the tuning, hardware, 
network, etc. that come into play.


There isn't an easy way to deal with this :P




Batch mode was created to be able to more efficiently process messages that 
are inserted into databases, we then found that the reduced queue 
congestion was a significant advantage in itself.


But unless you have a queue just for the ES action,


That's what we had to do for the fluentd case - we have a separate "ES retry 
queue".  One of the tricky parts is that there may be multiple outputs - you 
may want to send each log record to Elasticsearch _and_ a message bus _and_ a 
remote rsyslog forwarder. But you only want to retry sending to Elasticsearch 
to avoid duplication in the other outputs.


In Rsyslog, queues are explicitly configured by the admin (for various reasons, 
including performance and reliability trade-offs), I really don't like the idea 
of omelasticsearch creating it's own queue without these options. Kafka does 
this and it's an ongoing source of problems.

___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-17 Thread Rich Megginson via rsyslog

On 05/17/2018 05:52 AM, Brian Knox wrote:
To my knowledge, Rich is correct. This also would explain a case we 
hit maybe every couple of months, where rsyslog very quickly 
duplicates some messages it is sending to elasticsearch. I would 
assume this would be a case where a batch is submitted, only some of 
the messages are rejected, and rsyslog then duplicates messages trying 
to send the batch over and over again.


You can confirm this by monitoring the bulk index thread pool 
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html 
to see if you are getting bulk rejections.




On Thu, May 17, 2018 at 12:08 AM David Lang > wrote:


On Wed, 16 May 2018, Rich Megginson wrote:

> On 05/16/2018 05:58 PM, David Lang wrote:
>> there's no need to add this extra complexity (multiple rulesets
and queues)
>>
>> What should be happening (on any output module) is:
>>
>> submit a batch.
>>    If rejected with a soft error, retry/suspend the output
>
> retry of the entire batch?  see below
>
>> if batch-size=1 and a hard error, send to errorfile
>>    if rejected with a hard error resubmit half of the batch
>
> But what if 90% of the batch was successfully added? Then you
are needlessly
> resubmitting many of the records in the batch.

when submitting batches, you get a success/fail for the batch as a
whole (for
99% of things that actually allow you to insert in batches), so
you don't know
what message failed. This is a database transaction (again, in
most cases), so
if a batch fails, all you can do is bisect to figure out what
message fails. If
the endpoint is inserting some of the messages from a batch that
fails, that's
usually a bad thing.

now, if ES batch mode isn't an ACID transaction and it accepts
some messages and
then tells you which ones failed, then you can mark the ones
accepted as done
and just retry the ones that fail. But there's still no need for a
separate
ruleset and queue. In Rsyslog, if an output cannot accept a
message and there's
reason to think that it will in the future, then you suspend that
output and try
again later. If you have reason to believe that the message is
never going to be
able to be delivered, then you need to fail the message or you
will be stuck
forever. This is what the error output was made for.

> If using the "index" (default) bulk type, this causes duplicate
records to be
> added.
> If using the "create" type (and you have assigned a unique _id),
you will get
> back many 409 Duplicate errors.
> This causes problems - we know because this is how the fluentd
plugin used to
> work, which is why we had to change it.
>
>

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
> "Bulk Rejections"
> "It is much better to handle queuing in your application by
gracefully
> handling the back pressure from a full queue. When you receive bulk
> rejections, you should take these steps:
>
>     Pause the import thread for 3–5 seconds.
>     Extract the rejected actions from the bulk response, since
it is probable
> that many of the actions were successful. The bulk response will
tell you
> which succeeded and which were rejected.
>     Send a new bulk request with just the rejected actions.
>     Repeat from step 1 if rejections are encountered again.
>
> Using this procedure, your code naturally adapts to the load of
your cluster
> and naturally backs off.
> "

Does it really accept some and reject some in a random manner? or
is it a matter
of accepting the first X and rejecting any after that point? The
first is easier
to deal with.

Batch mode was created to be able to more efficiently process
messages that are
inserted into databases, we then found that the reduced queue
congestion was a
significant advantage in itself.

But unless you have a queue just for the ES action, doing queue
manipulation
isn't possible, all you can do is succeed or fail, and if you
fail, the retry
logic will kick in.

Rainer is going to need to comment on this.

David Lang

>
>> repeat
>>
>> all that should be needed is to add tests into omelasticsearch
to detect
>> the soft errors and turn them into retries (or suspend the
output as
>> appropriate)
>>
>> David Lang
>
>
>
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are 

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-17 Thread Rich Megginson via rsyslog

On 05/16/2018 10:08 PM, David Lang wrote:

On Wed, 16 May 2018, Rich Megginson wrote:


On 05/16/2018 05:58 PM, David Lang wrote:
there's no need to add this extra complexity (multiple rulesets and 
queues)


What should be happening (on any output module) is:

submit a batch.
   If rejected with a soft error, retry/suspend the output


retry of the entire batch?  see below


if batch-size=1 and a hard error, send to errorfile
   if rejected with a hard error resubmit half of the batch


But what if 90% of the batch was successfully added?  Then you are 
needlessly resubmitting many of the records in the batch.


when submitting batches, you get a success/fail for the batch as a 
whole (for 99% of things that actually allow you to insert in batches),


For Elasticsearch - yes, there is a top level "errors" field in the 
response with a binary value true or false.  true means all records in 
the batch were successfully processed. false means _at least one_ record 
in the batch was not processed successfully.  For example, in a batch of 
1 records, you will get an response of "errors": true if  of 
those records were successfully processed.



so you don't know what message failed.


You do know exactly which record failed and in most cases what the error 
was.  Here is an example from the fluent-plugin-elasticsearch unit test: 
https://github.com/uken/fluent-plugin-elasticsearch/blob/master/test/plugin/test_elasticsearch_error_handler.rb#L88
This is what the response looks like coming from Elasticsearch.  You get 
a separate response item for every record submitted in the bulk 
request.  In addition, you are guaranteed that the order of the items in 
the response is exactly the same as the order of the items submitted in 
the bulk request, so that you can exactly correlate the request object 
with the response.



This is a database transaction (again, in most cases),


Not in Elasticsearch at the bulk index level.  Probably at the very low 
level where lucene hits the disk.


so if a batch fails, all you can do is bisect to figure out what 
message fails. If the endpoint is inserting some of the messages from 
a batch that fails, that's usually a bad thing.


now, if ES batch mode isn't an ACID transaction and it accepts some 
messages and then tells you which ones failed,


It does

then you can mark the ones accepted as done and just retry the ones 
that fail.


That's what I'm proposing.

But there's still no need for a separate ruleset and queue. In 
Rsyslog, if an output cannot accept a message and there's reason to 
think that it will in the future, then you suspend that output and try 
again later. If you have reason to believe that the message is never 
going to be able to be delivered, then you need to fail the message or 
you will be stuck forever. This is what the error output was made for.


So how would that work on a per-record basis?

Would this be something different than using MsgConstruct -> set fields 
in msg from original request -> ratelimitAddMsg for each record to resubmit?




If using the "index" (default) bulk type, this causes duplicate 
records to be added.
If using the "create" type (and you have assigned a unique _id), you 
will get back many 409 Duplicate errors.
This causes problems - we know because this is how the fluentd plugin 
used to work, which is why we had to change it.


https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section 


"Bulk Rejections"
"It is much better to handle queuing in your application by 
gracefully handling the back pressure from a full queue. When you 
receive bulk rejections, you should take these steps:


    Pause the import thread for 3–5 seconds.
    Extract the rejected actions from the bulk response, since it is 
probable that many of the actions were successful. The bulk response 
will tell you which succeeded and which were rejected.

    Send a new bulk request with just the rejected actions.
    Repeat from step 1 if rejections are encountered again.

Using this procedure, your code naturally adapts to the load of your 
cluster and naturally backs off.

"


Does it really accept some and reject some in a random manner? or is 
it a matter of accepting the first X and rejecting any after that 
point? The first is easier to deal with.


It appears to be random.  So you may get a failure from the first record 
in the batch and the last record in the batch, and success for the 
others.  Or vice versa.  There appear to be many, many factors in the 
tuning, hardware, network, etc. that come into play.


There isn't an easy way to deal with this :P




Batch mode was created to be able to more efficiently process messages 
that are inserted into databases, we then found that the reduced queue 
congestion was a significant advantage in itself.


But unless you have a queue just for the ES action,


That's what we had to do for the fluentd case - we have a separate "ES 
retry 

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-17 Thread Brian Knox via rsyslog
To my knowledge, Rich is correct. This also would explain a case we hit
maybe every couple of months, where rsyslog very quickly duplicates some
messages it is sending to elasticsearch. I would assume this would be a
case where a batch is submitted, only some of the messages are rejected,
and rsyslog then duplicates messages trying to send the batch over and over
again.

On Thu, May 17, 2018 at 12:08 AM David Lang  wrote:

> On Wed, 16 May 2018, Rich Megginson wrote:
>
> > On 05/16/2018 05:58 PM, David Lang wrote:
> >> there's no need to add this extra complexity (multiple rulesets and
> queues)
> >>
> >> What should be happening (on any output module) is:
> >>
> >> submit a batch.
> >>If rejected with a soft error, retry/suspend the output
> >
> > retry of the entire batch?  see below
> >
> >> if batch-size=1 and a hard error, send to errorfile
> >>if rejected with a hard error resubmit half of the batch
> >
> > But what if 90% of the batch was successfully added?  Then you are
> needlessly
> > resubmitting many of the records in the batch.
>
> when submitting batches, you get a success/fail for the batch as a whole
> (for
> 99% of things that actually allow you to insert in batches), so you don't
> know
> what message failed. This is a database transaction (again, in most
> cases), so
> if a batch fails, all you can do is bisect to figure out what message
> fails. If
> the endpoint is inserting some of the messages from a batch that fails,
> that's
> usually a bad thing.
>
> now, if ES batch mode isn't an ACID transaction and it accepts some
> messages and
> then tells you which ones failed, then you can mark the ones accepted as
> done
> and just retry the ones that fail. But there's still no need for a
> separate
> ruleset and queue. In Rsyslog, if an output cannot accept a message and
> there's
> reason to think that it will in the future, then you suspend that output
> and try
> again later. If you have reason to believe that the message is never going
> to be
> able to be delivered, then you need to fail the message or you will be
> stuck
> forever. This is what the error output was made for.
>
> > If using the "index" (default) bulk type, this causes duplicate records
> to be
> > added.
> > If using the "create" type (and you have assigned a unique _id), you
> will get
> > back many 409 Duplicate errors.
> > This causes problems - we know because this is how the fluentd plugin
> used to
> > work, which is why we had to change it.
> >
> >
> https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
> > "Bulk Rejections"
> > "It is much better to handle queuing in your application by gracefully
> > handling the back pressure from a full queue. When you receive bulk
> > rejections, you should take these steps:
> >
> > Pause the import thread for 3–5 seconds.
> > Extract the rejected actions from the bulk response, since it is
> probable
> > that many of the actions were successful. The bulk response will tell
> you
> > which succeeded and which were rejected.
> > Send a new bulk request with just the rejected actions.
> > Repeat from step 1 if rejections are encountered again.
> >
> > Using this procedure, your code naturally adapts to the load of your
> cluster
> > and naturally backs off.
> > "
>
> Does it really accept some and reject some in a random manner? or is it a
> matter
> of accepting the first X and rejecting any after that point? The first is
> easier
> to deal with.
>
> Batch mode was created to be able to more efficiently process messages
> that are
> inserted into databases, we then found that the reduced queue congestion
> was a
> significant advantage in itself.
>
> But unless you have a queue just for the ES action, doing queue
> manipulation
> isn't possible, all you can do is succeed or fail, and if you fail, the
> retry
> logic will kick in.
>
> Rainer is going to need to comment on this.
>
> David Lang
>
> >
> >> repeat
> >>
> >> all that should be needed is to add tests into omelasticsearch to
> detect
> >> the soft errors and turn them into retries (or suspend the output as
> >> appropriate)
> >>
> >> David Lang
> >
> >
> >
> ___
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you 

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-16 Thread David Lang

On Wed, 16 May 2018, Rich Megginson wrote:


On 05/16/2018 05:58 PM, David Lang wrote:

there's no need to add this extra complexity (multiple rulesets and queues)

What should be happening (on any output module) is:

submit a batch.
   If rejected with a soft error, retry/suspend the output


retry of the entire batch?  see below


if batch-size=1 and a hard error, send to errorfile
   if rejected with a hard error resubmit half of the batch


But what if 90% of the batch was successfully added?  Then you are needlessly 
resubmitting many of the records in the batch.


when submitting batches, you get a success/fail for the batch as a whole (for 
99% of things that actually allow you to insert in batches), so you don't know 
what message failed. This is a database transaction (again, in most cases), so 
if a batch fails, all you can do is bisect to figure out what message fails. If 
the endpoint is inserting some of the messages from a batch that fails, that's 
usually a bad thing.


now, if ES batch mode isn't an ACID transaction and it accepts some messages and 
then tells you which ones failed, then you can mark the ones accepted as done 
and just retry the ones that fail. But there's still no need for a separate 
ruleset and queue. In Rsyslog, if an output cannot accept a message and there's 
reason to think that it will in the future, then you suspend that output and try 
again later. If you have reason to believe that the message is never going to be 
able to be delivered, then you need to fail the message or you will be stuck 
forever. This is what the error output was made for.


If using the "index" (default) bulk type, this causes duplicate records to be 
added.
If using the "create" type (and you have assigned a unique _id), you will get 
back many 409 Duplicate errors.
This causes problems - we know because this is how the fluentd plugin used to 
work, which is why we had to change it.


https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
"Bulk Rejections"
"It is much better to handle queuing in your application by gracefully 
handling the back pressure from a full queue. When you receive bulk 
rejections, you should take these steps:


    Pause the import thread for 3–5 seconds.
    Extract the rejected actions from the bulk response, since it is probable 
that many of the actions were successful. The bulk response will tell you 
which succeeded and which were rejected.

    Send a new bulk request with just the rejected actions.
    Repeat from step 1 if rejections are encountered again.

Using this procedure, your code naturally adapts to the load of your cluster 
and naturally backs off.

"


Does it really accept some and reject some in a random manner? or is it a matter 
of accepting the first X and rejecting any after that point? The first is easier 
to deal with.


Batch mode was created to be able to more efficiently process messages that are 
inserted into databases, we then found that the reduced queue congestion was a 
significant advantage in itself.


But unless you have a queue just for the ES action, doing queue manipulation 
isn't possible, all you can do is succeed or fail, and if you fail, the retry 
logic will kick in.


Rainer is going to need to comment on this.

David Lang




repeat

all that should be needed is to add tests into omelasticsearch to detect 
the soft errors and turn them into retries (or suspend the output as 
appropriate)


David Lang





___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-16 Thread Rich Megginson via rsyslog

On 05/16/2018 05:58 PM, David Lang wrote:
there's no need to add this extra complexity (multiple rulesets and 
queues)


What should be happening (on any output module) is:

submit a batch.
   If rejected with a soft error, retry/suspend the output


retry of the entire batch?  see below


if batch-size=1 and a hard error, send to errorfile
   if rejected with a hard error resubmit half of the batch


But what if 90% of the batch was successfully added?  Then you are 
needlessly resubmitting many of the records in the batch.
If using the "index" (default) bulk type, this causes duplicate records 
to be added.
If using the "create" type (and you have assigned a unique _id), you 
will get back many 409 Duplicate errors.
This causes problems - we know because this is how the fluentd plugin 
used to work, which is why we had to change it.


https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
"Bulk Rejections"
"It is much better to handle queuing in your application by gracefully 
handling the back pressure from a full queue. When you receive bulk 
rejections, you should take these steps:


    Pause the import thread for 3–5 seconds.
    Extract the rejected actions from the bulk response, since it is 
probable that many of the actions were successful. The bulk response 
will tell you which succeeded and which were rejected.

    Send a new bulk request with just the rejected actions.
    Repeat from step 1 if rejections are encountered again.

Using this procedure, your code naturally adapts to the load of your 
cluster and naturally backs off.

"



repeat

all that should be needed is to add tests into omelasticsearch to 
detect the soft errors and turn them into retries (or suspend the 
output as appropriate)


David Lang



___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] omelasticsearch - failed operation handling

2018-05-16 Thread David Lang

there's no need to add this extra complexity (multiple rulesets and queues)

What should be happening (on any output module) is:

submit a batch.
   If rejected with a soft error, retry/suspend the output
   if batch-size=1 and a hard error, send to errorfile
   if rejected with a hard error resubmit half of the batch
repeat

all that should be needed is to add tests into omelasticsearch to detect the 
soft errors and turn them into retries (or suspend the output as appropriate)


David Lang
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


[rsyslog] omelasticsearch - failed operation handling

2018-05-16 Thread Rich Megginson via rsyslog
In many cases, when adding a record to Elasticsearch, an http status not 
200 or 201 does not necessarily indicate that the record cannot be 
added.  One case is bulk index rejection - in this case, the http status 
for the record in the response is 429, and it may be that a short pause 
is required before resubmitting the record.


omelasticsearch has support for an errorfile, but this requires the 
operator to examine the file and resubmit it manually.  The fluentd 
elasticsearch plugin recently got support for better error handling: 
https://github.com/richm/docs/blob/master/fluent-plugin-elasticsearch-retry.md


I would like to do the same with omelasticsearch. It would work 
something like this:


- a record that fails with a "soft" error would be sent to a "retry queue"

- a record that fails with a "hard" error would be sent to an "error queue"

In fluentd this is best accomplished with judicious use of tagging and 
labeling.


I think for rsyslog, this would best be handled by rulesets - a retry_es 
ruleset, and an error ruleset.


We have the original request in JSON string form, and the response JSON 
contains 1 response object for each request object, in the exact same order.


pseudo code:

if response is an error

  for each item in response

    get the corresponding request string

    convert request json string to json object

    MsgNew - set Msg fields

    if is soft error

  MsgSetRuleset retry_es ruleset

    else

  MsgSetRuleset error ruleset

    ratelimitAddMsg

Note that status 200 and 201 are success, and status 409 when using the 
"create" operation (I'm also working on adding support for this) is a 
duplicate and is considered successful.


Without using a ruleset, I assume a message would be submitted at the 
"top" of the processing pipeline, and would require a lot of work to 
make that pipeline idempotent in most cases.


The config would look something like this:

ruleset(name="error_es") {

  action(type="omfile" ... write hard failures to error file ...)

}

ruleset(name="try_es") {

  action(type="omelasticsearch"  retryRuleset="try_es" 
errorRuleset="error_es" ...)


}

... normal pipeline ...

call try_es # for both the normal case and the retry case

___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.