Harkishen, thank you very much for the design document! My initial thoughts are to agree with Stuart (as well as some users in the linked github issue) that it makes the most sense to start with dropping data that is older than some configured age. The default being to never drop data. For most outage scenarios I think this is the easiest to understand, and if there is an outage retrying old data x times still does not help you much.
There are a couple use cases that an age based solution doesn't solve ideally: 1. An issue where bad data is causing the upstream system to break, e.g. I have seen a system return a 5xx due to a null byte in a label value causing some sort of panic. This blocks Prometheus from being able to process any samples newer than that bad sample. Yes this is an issue with the remote storage, but it sucks when it happens and it would be nice to have an easy workaround while a fix goes into the remote system. In this scenario, only dropping old data still means you wouldn't be sending anything new for quite awhile, and if the bad data is persistent you would likely just end up 10minutes to an hour behind permanently (whatever you set the age to be). 2. Retrying 429 errors, a new feature currently behind a flag, but it could make sense to only retry 429s a couple of times (if you want to retry them at all) but then drop the data so that non-rate limited requests can proceed in the future. I think to start with the above limitations are fine and the age based system is probably the way to go. I also wonder if it is worth defining a more generic "retry_policies" section of remote write that could contain different options for 5xx vs 429. On Mon, Mar 1, 2021 at 3:32 AM Ben Kochie <sup...@gmail.com> wrote: > If a remote write receiver is unable to ingest, wouldn't this be something > to fix on the receiver side? The receiver could have a policy where it > drops data rather than returning an error. > > This way Prometheus sends, but doesn't have to need to know or deal with > ingestion policies. It sends a bit more data over the wire, but that part > is cheap compared to the ingestion costs. > I certainly see the argument that this could all be cast as a receiver-side issue, but I have also personally experienced outages that were much harder to recover from due to a thundering herd scenario once the service was restored. E.g. cortex distributors (where an ingestion policy would be implemented) effectively locking up or OOMing at a high enough request rate. Also, an administrator may not be able to update whatever remote storage solution they use. This becomes even more painful in a resource constrained environment. The solution right now is to go restart all of your Prometheus instances to indiscriminately drop data, I would prefer to be intentional about what data is dropped. I would certainly be happy to jump on a call sometime with interested parties if that would be more efficient :) Chris -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CANVFovVB_xBAXuLhZSBD8nh2LcH%3D-Svq1Ys_m99%3DL-jF3TP0Gg%40mail.gmail.com.