Re: [prometheus-developers] Remote-write drop samples | design doc

2021-03-22 Thread Stuart Clark

On 2021-03-22 11:47, Harkishen Singh wrote:

Thank you everyone for the suggestions!

I agree with the age-based solutions, but such a solution is useful to
particularly those systems that have a limitation on time. Many don't
have that. But seeing the scenario, can we have both, so if users have
a remote-storage system that respects time, then they can use the
time-based dropping logic. If the user has a remote-storage that can
accept a sample with any timestamp (past or future), he can use the
retries count method. This will avoid recurring errors, like the null
byte.

We can have something like LIMITRETRYPOLICY as TIME or RETRIES. If its
TIME, we choose the max time (taken as input). If the policy is
RETRIES, then a count would be the input for the maximum retries. That
way, we solve both the problems and leave it up to the user to
consider it, based on the storage system he is using.

Does that look good to go, or we do just the age-based way?



The time based isn't just about handling remote write receivers than can 
only ingest samples up to a certain age, but also to encapsulate policy 
about what still matters.


Even if my receiver can ingest metrics from any time it is quite 
possible that I don't care about data older than a certain period. For 
example I might be doing something ML related that can be used for 
autoremediation, so I want all the data but after 30 minutes it becomes 
irrelivant. So even though it might accept older data I can set the 
limit to 30 mins so Prometheus just drops it instead of trying to resend 
(possibly unblocking more recent data in the process).



--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/97a200f341be01d0200477f726d4eab7%40Jahingo.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-03-22 Thread Harkishen Singh
Thank you everyone for the suggestions!

I agree with the age-based solutions, but such a solution is useful to 
particularly those systems that have a limitation on time. Many don't have 
that. But seeing the scenario, can we have both, so if users have a 
remote-storage system that respects time, then they can use the time-based 
dropping logic. If the user has a remote-storage that can accept a sample 
with any timestamp (past or future), he can use the retries count method. 
This will avoid recurring errors, like the null byte.

We can have something like *LimitRetryPolicy* as *time* or *retries*. If 
its *time*, we choose the max time (taken as input). If the policy is 
*retries*, then a count would be the input for the maximum retries. That 
way, we solve both the problems and leave it up to the user to consider it, 
based on the storage system he is using.

Does that look good to go, or we do just the age-based way?

Thank you

On Tuesday, March 2, 2021 at 6:32:02 AM UTC+5:30 csmarc...@gmail.com wrote:

> Harkishen, thank you very much for the design document!
>
> My initial thoughts are to agree with Stuart (as well as some users in the 
> linked github issue) that it makes the most sense to start with dropping 
> data that is older than some configured age. The default being to never 
> drop data. For most outage scenarios I think this is the easiest to 
> understand, and if there is an outage retrying old data x times still does 
> not help you much.
>
> There are a couple use cases that an age based solution doesn't solve 
> ideally:
> 1. An issue where bad data is causing the upstream system to break, e.g. I 
> have seen a system return a 5xx due to a null byte in a label value causing 
> some sort of panic. This blocks Prometheus from being able to process any 
> samples newer than that bad sample. Yes this is an issue with the remote 
> storage, but it sucks when it happens and it would be nice to have an easy 
> workaround while a fix goes into the remote system. In this scenario, only 
> dropping old data still means you wouldn't be sending anything new for 
> quite awhile, and if the bad data is persistent you would likely just end 
> up 10minutes to an hour behind permanently (whatever you set the age to be).
> 2. Retrying 429 errors, a new feature currently behind a flag, but it 
> could make sense to only retry 429s a couple of times (if you want to retry 
> them at all) but then drop the data so that non-rate limited requests can 
> proceed in the future.
>
> I think to start with the above limitations are fine and the age based 
> system is probably the way to go. I also wonder if it is worth defining a 
> more generic "retry_policies" section of remote write that could contain 
> different options for 5xx vs 429.
>
> On Mon, Mar 1, 2021 at 3:32 AM Ben Kochie  wrote:
>
>> If a remote write receiver is unable to ingest, wouldn't this be 
>> something to fix on the receiver side? The receiver could have a policy 
>> where it drops data rather than returning an error.
>>
>> This way Prometheus sends, but doesn't have to need to know or deal with 
>> ingestion policies. It sends a bit more data over the wire, but that part 
>> is cheap compared to the ingestion costs.
>>
>
> I certainly see the argument that this could all be cast as a 
> receiver-side issue, but I have also personally experienced outages that 
> were much harder to recover from due to a thundering herd scenario once the 
> service was restored. E.g. cortex distributors (where an ingestion policy 
> would be implemented) effectively locking up or OOMing at a high enough 
> request rate. Also, an administrator may not be able to update whatever 
> remote storage solution they use. This becomes even more painful in a 
> resource constrained environment. The solution right now is to go restart 
> all of your Prometheus instances to indiscriminately drop data, I would 
> prefer to be intentional about what data is dropped.
>
> I would certainly be happy to jump on a call sometime with interested 
> parties if that would be more efficient :)
>
> Chris
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/e9c28b1b-feb2-4f8f-8d98-a4a2a982956bn%40googlegroups.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-03-01 Thread Chris Marchbanks
Harkishen, thank you very much for the design document!

My initial thoughts are to agree with Stuart (as well as some users in the
linked github issue) that it makes the most sense to start with dropping
data that is older than some configured age. The default being to never
drop data. For most outage scenarios I think this is the easiest to
understand, and if there is an outage retrying old data x times still does
not help you much.

There are a couple use cases that an age based solution doesn't solve
ideally:
1. An issue where bad data is causing the upstream system to break, e.g. I
have seen a system return a 5xx due to a null byte in a label value causing
some sort of panic. This blocks Prometheus from being able to process any
samples newer than that bad sample. Yes this is an issue with the remote
storage, but it sucks when it happens and it would be nice to have an easy
workaround while a fix goes into the remote system. In this scenario, only
dropping old data still means you wouldn't be sending anything new for
quite awhile, and if the bad data is persistent you would likely just end
up 10minutes to an hour behind permanently (whatever you set the age to be).
2. Retrying 429 errors, a new feature currently behind a flag, but it could
make sense to only retry 429s a couple of times (if you want to retry them
at all) but then drop the data so that non-rate limited requests can
proceed in the future.

I think to start with the above limitations are fine and the age based
system is probably the way to go. I also wonder if it is worth defining a
more generic "retry_policies" section of remote write that could contain
different options for 5xx vs 429.

On Mon, Mar 1, 2021 at 3:32 AM Ben Kochie  wrote:

> If a remote write receiver is unable to ingest, wouldn't this be something
> to fix on the receiver side? The receiver could have a policy where it
> drops data rather than returning an error.
>
> This way Prometheus sends, but doesn't have to need to know or deal with
> ingestion policies. It sends a bit more data over the wire, but that part
> is cheap compared to the ingestion costs.
>

I certainly see the argument that this could all be cast as a receiver-side
issue, but I have also personally experienced outages that were much harder
to recover from due to a thundering herd scenario once the service was
restored. E.g. cortex distributors (where an ingestion policy would be
implemented) effectively locking up or OOMing at a high enough request
rate. Also, an administrator may not be able to update whatever remote
storage solution they use. This becomes even more painful in a resource
constrained environment. The solution right now is to go restart all of
your Prometheus instances to indiscriminately drop data, I would prefer to
be intentional about what data is dropped.

I would certainly be happy to jump on a call sometime with interested
parties if that would be more efficient :)

Chris

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CANVFovVB_xBAXuLhZSBD8nh2LcH%3D-Svq1Ys_m99%3DL-jF3TP0Gg%40mail.gmail.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-03-01 Thread Ben Kochie
If a remote write receiver is unable to ingest, wouldn't this be something
to fix on the receiver side? The receiver could have a policy where it
drops data rather than returning an error.

This way Prometheus sends, but doesn't have to need to know or deal with
ingestion policies. It sends a bit more data over the wire, but that part
is cheap compared to the ingestion costs.

On Mon, Mar 1, 2021 at 11:13 AM Stuart Clark 
wrote:

> On 01/03/2021 07:25, Harkishen Singh wrote:
> > Hi Tom,
> >
> > I have tried to answer the comments. Please comment on their
> > satisfactoriness. I am happy for a call if required (or discussion
> > gets tough).
> >
> > I think, the lossless nature can be controlled by the user based on
> > the config (limit_retries), and let the users have more control, as to
> > whether they are happy to compromise a bit, if the retry is too much,
> > since as such, if the retrying happens forever, then I don't think
> > that is helpful (it will never be accepted by the remote storage).
> > Also as Chris mentioned, some users might prefer to have few gaps and
> > give more priority to recent data, like for alerting. So, I think this
> > approach gives more flexibility to the user, at the same time, making
> > it optional (or by setting the retry count high enough).
> >
> Under what situations would retries happen forever?
>
> If the receiver is available but cannot accept the data (for example due
> to metric size limits or age of the samples) I would expect it to reject
> with a 4XX code (permanent failure) which wouldn't trigger any retries.
>
> Alternatively if the receiver is either unavailable or broken it could
> result in "infinite" retries, but in that situation it feels like an age
> based limit instead of retry limit would be better - a short retry limit
> will reject samples that have just been scraped just as quickly as
> samples that are days old. Instead it sounds like an age based limit
> would be better - some systems have restrictions over what age can be
> ingested (e.g. Timestream) or administrators could decide older data has
> no usefulness (e.g. if the receiver is used for alerting or anomaly
> detection. While the system should still reject such old samples once it
> is working again a time based limit would at least reduce the network
> impact once the receiver is back online (no need to send tons of data
> that we know will be rejected).
>
> --
> Stuart Clark
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/cd97f615-e479-e4be-e85d-672b15c337d8%40Jahingo.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABbyFmpOC8EPAnHsj0Zyh5JSworYLciDL6nCXyzSSnHAX981RA%40mail.gmail.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-03-01 Thread Harkishen Singh
Hey Stuart,

Thank you for your suggestion.

Yes, I think an age-based can be implemented as well. I think we should 
keep both max retry and age limit. Age limit would be helpful for 
time-based remote-storages, and non-age based can be in general (like non 
time-based storage systems) that will help in situations of too much 
congestion in the network.

On Monday, March 1, 2021 at 3:43:21 PM UTC+5:30 Stuart Clark wrote:

> On 01/03/2021 07:25, Harkishen Singh wrote:
> > Hi Tom,
> >
> > I have tried to answer the comments. Please comment on their 
> > satisfactoriness. I am happy for a call if required (or discussion 
> > gets tough).
> >
> > I think, the lossless nature can be controlled by the user based on 
> > the config (limit_retries), and let the users have more control, as to 
> > whether they are happy to compromise a bit, if the retry is too much, 
> > since as such, if the retrying happens forever, then I don't think 
> > that is helpful (it will never be accepted by the remote storage). 
> > Also as Chris mentioned, some users might prefer to have few gaps and 
> > give more priority to recent data, like for alerting. So, I think this 
> > approach gives more flexibility to the user, at the same time, making 
> > it optional (or by setting the retry count high enough).
> >
> Under what situations would retries happen forever?
>
> If the receiver is available but cannot accept the data (for example due 
> to metric size limits or age of the samples) I would expect it to reject 
> with a 4XX code (permanent failure) which wouldn't trigger any retries.
>
> Alternatively if the receiver is either unavailable or broken it could 
> result in "infinite" retries, but in that situation it feels like an age 
> based limit instead of retry limit would be better - a short retry limit 
> will reject samples that have just been scraped just as quickly as 
> samples that are days old. Instead it sounds like an age based limit 
> would be better - some systems have restrictions over what age can be 
> ingested (e.g. Timestream) or administrators could decide older data has 
> no usefulness (e.g. if the receiver is used for alerting or anomaly 
> detection. While the system should still reject such old samples once it 
> is working again a time based limit would at least reduce the network 
> impact once the receiver is back online (no need to send tons of data 
> that we know will be rejected).
>
> -- 
> Stuart Clark
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/ea98ba09-928f-46d3-8618-f875f984fe48n%40googlegroups.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-03-01 Thread Stuart Clark

On 01/03/2021 07:25, Harkishen Singh wrote:

Hi Tom,

I have tried to answer the comments. Please comment on their 
satisfactoriness. I am happy for a call if required (or discussion 
gets tough).


I think, the lossless nature can be controlled by the user based on 
the config (limit_retries), and let the users have more control, as to 
whether they are happy to compromise a bit, if the retry is too much, 
since as such, if the retrying happens forever, then I don't think 
that is helpful (it will never be accepted by the remote storage). 
Also as Chris mentioned, some users might prefer to have few gaps and 
give more priority to recent data, like for alerting. So, I think this 
approach gives more flexibility to the user, at the same time, making 
it optional (or by setting the retry count high enough).



Under what situations would retries happen forever?

If the receiver is available but cannot accept the data (for example due 
to metric size limits or age of the samples) I would expect it to reject 
with a 4XX code (permanent failure) which wouldn't trigger any retries.


Alternatively if the receiver is either unavailable or broken it could 
result in "infinite" retries, but in that situation it feels like an age 
based limit instead of retry limit would be better - a short retry limit 
will reject samples that have just been scraped just as quickly as 
samples that are days old. Instead it sounds like an age based limit 
would be better - some systems have restrictions over what age can be 
ingested (e.g. Timestream) or administrators could decide older data has 
no usefulness (e.g. if the receiver is used for alerting or anomaly 
detection. While the system should still reject such old samples once it 
is working again a time based limit would at least reduce the network 
impact once the receiver is back online (no need to send tons of data 
that we know will be rejected).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/cd97f615-e479-e4be-e85d-672b15c337d8%40Jahingo.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-02-28 Thread Harkishen Singh
Hi Tom,

I have tried to answer the comments. Please comment on their 
satisfactoriness. I am happy for a call if required (or discussion gets 
tough).

I think, the lossless nature can be controlled by the user based on the 
config (limit_retries), and let the users have more control, as to whether 
they are happy to compromise a bit, if the retry is too much, since as 
such, if the retrying happens forever, then I don't think that is helpful 
(it will never be accepted by the remote storage). Also as Chris mentioned, 
some users might prefer to have few gaps and give more priority to recent 
data, like for alerting. So, I think this approach gives more flexibility 
to the user, at the same time, making it optional (or by setting the retry 
count high enough).

(apologies if I am wrong somewhere)

Thank you
Harkishen Singh

On Saturday, February 27, 2021 at 7:36:48 PM UTC+5:30 tom.w...@gmail.com 
wrote:

> Hi Harkishen! Thank you for the doc - I'm really excited to see more 
> interest in Prometheus remote write.
>
> We can go back and forth on the doc with comments, but perhaps it would be 
> easier to have a chat over VC with Chris + I?   My main concerns are that 
> we preserve the lossless nature of the remote write, and I worry limiting 
> the number retries on 500s will undermine this.
>
> Cheers
>
> Tom
>
> On Fri, Feb 26, 2021 at 4:56 PM Harkishen Singh  
> wrote:
>
>> Hello everyone,
>>
>> I had started to work on #7912 
>>  and have written 
>> a design doc for it. Please give your suggestions/feedbacks/improvements by 
>> commenting on the doc below.
>>
>> Design doc link: click here 
>> 
>>
>> Thank you
>>
>> Harkishen Singh
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-devel...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-developers/5c2810d3-2617-48f5-b8af-c360d787cf41n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/8660d6e0-3cde-4128-9cfb-8ff4d555b0a4n%40googlegroups.com.


Re: [prometheus-developers] Remote-write drop samples | design doc

2021-02-27 Thread Tom Wilkie
Hi Harkishen! Thank you for the doc - I'm really excited to see more
interest in Prometheus remote write.

We can go back and forth on the doc with comments, but perhaps it would be
easier to have a chat over VC with Chris + I?   My main concerns are that
we preserve the lossless nature of the remote write, and I worry limiting
the number retries on 500s will undermine this.

Cheers

Tom

On Fri, Feb 26, 2021 at 4:56 PM Harkishen Singh 
wrote:

> Hello everyone,
>
> I had started to work on #7912
>  and have written a
> design doc for it. Please give your suggestions/feedbacks/improvements by
> commenting on the doc below.
>
> Design doc link: click here
> 
>
> Thank you
>
> Harkishen Singh
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/5c2810d3-2617-48f5-b8af-c360d787cf41n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAB58Z12MDEw_UVPHc3QZL1Z21CaPy0f0W2U_uZNppaeBFi0L9w%40mail.gmail.com.