Harkishen, thank you very much for the design document!

My initial thoughts are to agree with Stuart (as well as some users in the
linked github issue) that it makes the most sense to start with dropping
data that is older than some configured age. The default being to never
drop data. For most outage scenarios I think this is the easiest to
understand, and if there is an outage retrying old data x times still does
not help you much.

There are a couple use cases that an age based solution doesn't solve
ideally:
1. An issue where bad data is causing the upstream system to break, e.g. I
have seen a system return a 5xx due to a null byte in a label value causing
some sort of panic. This blocks Prometheus from being able to process any
samples newer than that bad sample. Yes this is an issue with the remote
storage, but it sucks when it happens and it would be nice to have an easy
workaround while a fix goes into the remote system. In this scenario, only
dropping old data still means you wouldn't be sending anything new for
quite awhile, and if the bad data is persistent you would likely just end
up 10minutes to an hour behind permanently (whatever you set the age to be).
2. Retrying 429 errors, a new feature currently behind a flag, but it could
make sense to only retry 429s a couple of times (if you want to retry them
at all) but then drop the data so that non-rate limited requests can
proceed in the future.

I think to start with the above limitations are fine and the age based
system is probably the way to go. I also wonder if it is worth defining a
more generic "retry_policies" section of remote write that could contain
different options for 5xx vs 429.

On Mon, Mar 1, 2021 at 3:32 AM Ben Kochie <sup...@gmail.com> wrote:

> If a remote write receiver is unable to ingest, wouldn't this be something
> to fix on the receiver side? The receiver could have a policy where it
> drops data rather than returning an error.
>
> This way Prometheus sends, but doesn't have to need to know or deal with
> ingestion policies. It sends a bit more data over the wire, but that part
> is cheap compared to the ingestion costs.
>

I certainly see the argument that this could all be cast as a receiver-side
issue, but I have also personally experienced outages that were much harder
to recover from due to a thundering herd scenario once the service was
restored. E.g. cortex distributors (where an ingestion policy would be
implemented) effectively locking up or OOMing at a high enough request
rate. Also, an administrator may not be able to update whatever remote
storage solution they use. This becomes even more painful in a resource
constrained environment. The solution right now is to go restart all of
your Prometheus instances to indiscriminately drop data, I would prefer to
be intentional about what data is dropped.

I would certainly be happy to jump on a call sometime with interested
parties if that would be more efficient :)

Chris

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CANVFovVB_xBAXuLhZSBD8nh2LcH%3D-Svq1Ys_m99%3DL-jF3TP0Gg%40mail.gmail.com.

Reply via email to