Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-06 Thread Bjoern Rabenstein
On 03.08.20 03:04, Rob Skillington wrote:
> Ok - I have a proposal which could be broken up into two pieces, first
> delivering TYPE per datapoint, the second consistently and reliably HELP and
> UNIT once per unique metric name:
> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
> /edit#heading=h.bik9uwphqy3g

Thanks for the doc. I have commented on it, but while doing so, I felt
the urge to comment more generally, which would not fit well into the
margin of a Google doc. My thoughts are also a bit out of scope of
Rob's design doc and more about the general topic of remote write and
the equally general topic of metadata (about which we have an ongoing
discussion among the Prometheus developers).

Disclaimer: I don't know the remote-write protocol very well. My hope
here is that my somewhat distant perspective is of some value as it
allows to take a step back. However, I might just miss crucial details
that completely invalidate my thoughts. We'll see...

I do care a lot about metadata, though. (And ironically, the reason
why I declared remote write "somebody else's problem" is that I've
always disliked how it fundamentally ignores metadata.)

Rob's document embraces the fact that metadata can change over time,
but it assumes that at any given time, there is only one set of
metadata per unique metric name. It takes into account that there can
be drift, but it considers them an irregularity that will only happen
occasionally and iron out over time.

In practice, however, metadata can be legitimately and deliberately
different for different time series of the same name. Instrumentation
libraries and even the exposition format inherently require one set of
metadata per metric name, but this is all only enforced (and meant to
be enforced) _per target_. Once the samples are ingested (or even sent
onwards via remote write), they have no notion of what target they
came from. Furthermore, samples created by rule evaluation don't have
an originating target in the first place. (Which raises the question
of metadata for recording rules, which is another can of worms I'd
like to open eventually...)

(There is also the technical difficulty that the WAL has no notion of
bundling or referencing all the series with the same metric name. That
was commented about in the doc but is not my focus here.)

Rob's doc sees TYPE as special because it is so cheap to just add to
every data point. That's correct, but it's giving me an itch: Should
we really create different ways of handling metadata, depending on its
expected size?

Compare this with labels. There is no upper limit to their number or
size. Still, we have no plan of treating "large" labels differently
from "short" labels.

On top of that, we have by now gained the insight that metadata is
changing over time and essentially has to be tracked per series.

Or in other words: From a pure storage perspective, metadata behaves
exactly the same as labels! (There are certainly huge differences
semantically, but those only manifest themselves on the query level,
i.e. how you treat it in PromQL etc.)

(This is not exactly a new insight. This is more or less what I said
during the 2016 dev summit, when we first discussed remote write. But
I don't want to dwell on "told you so" moments... :o)

There is a good reason why we don't just add metadata as "pseudo
labels": As discussed a lot in the various design docs including Rob's
one, it would blow up the data size significantly because HELP strings
tend to be relatively long.

And that's the point where I would like to take a step back: We are
discussing to essentially treat something that is structurally the
same thing in three different ways: Way 1 for labels as we know
them. Way 2 for "small" metadata. Way 3 for "big" metadata.

However, while labels tend to be shorter than HELP strings, there is
the occasional use case with long or many labels. (Infamously, at
SoundCloud, a binary accidentally put a whole HTML page into a
label. That wasn't a use case, it was a bug, but the Prometheus server
ingesting that was just chugging along as if nothing special had
happened. It looked weird in the expression browser, though...) I'm
sure any vendor offering Prometheus remote storage as a service will
have a customer or two that use excessively long label names. If we
have to deal with that, why not bite the bullet and treat metadata in
the same way as labels in general? Or to phrase it in another way: Any
solution for "big" metadata could be used for labels, too, to
alleviate the pain with excessively long label names.

Or most succintly: A robust and really good solution for
"big" metadata in remote write will make remote write much more
efficient if applied to labels, too.

Imagine an NALSD tech interview question that boils down to "design
Prometheus remote write". I bet that most of the better candidates
will recognize that most of the payload will consist of series
indentifiers (call them labels or whate

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-06 Thread Callum Styan
Thanks Rob for putting this proposal together, I think it highlights some
features of what we want metadata RW and remote write in general to look
like in the future. As others have pointed out (thanks Björn for giving
such a detailed description) there's issues with the way Prometheus
currently handles metadata that need to be thought about and handled
differently when storing metadata in the WAL or in long term storage. I
didn't make many more comments as most of what I wanted to say had already
been mentioned by others.

 As part of thinking about how to get metadata and exemplars into remote
write, some of us have been discussing what we've been calling 'the future
of remote write'. While there's nothing formal yet, I will be starting a
brainstorming/design doc soon and would appreciate your input there Rob.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAN2d5OTRMVzAqZ%3DLqpSv6bx9Z%2Btvw3MZCmBxj4e8B1T1YOJr6Q%40mail.gmail.com.


Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-06 Thread Rob Skillington
Hey Björn,


Thanks for the detailed response. I've had a few back and forths on this
with
Brian and Chris over IRC and CNCF Slack now too.

I agree that fundamentally it seems naive to idealistically model this
around
per metric name. It needs to be per series given what may happen w.r.t.
collision across targets, etc.

Perhaps we can separate these discussions apart into two considerations:

1) Modeling of the data such that it is kept around for transmission
(primarily
we're focused on WAL here).

2) Transmission (and of which you allude to has many areas for improvement).

For (1) - it seems like this needs to be done per time series, thankfully
we
actually already have modeled this to be stored per series data just once
in a
single WAL file. I will write up my proposal here, but it will surmount to
essentially encoding the HELP, UNIT and TYPE to the WAL per series similar
to
how labels for a series are encoded once per series in the WAL. Since this
optimization is in place, there's already a huge dampening effect on how
expensive it is to write out data about a series (e.g. labels). We can
always
go and collect a sample WAL file and measure how much extra size
with/without
HELP, UNIT and TYPE this would add, but it seems like it won't
fundamentally
change the order of magnitude in terms of "information about a timeseries
storage size" vs "datapoints about a timeseries storage size". One extra
change
would be re-encoding the series into the WAL if the HELP changed for that
series, just so that when HELP does change it can be up to date from the
view
of whoever is reading the WAL (i.e. the Remote Write loop). Since this
entry
needs to be loaded into memory for Remote Write today anyway, with string
interning as suggested by Chris, it won't change the memory profile
algorithmically of a Prometheus with Remote Write enabled. There will be
some
overhead that at most would likely be similar to the label data, but we
aren't
altering data structures (so won't change big-O magnitude of memory being
used),
we're adding fields to existing data structures that exist and string
interning
should actually make it much less onerous since there is a large duplicative
effect with HELP among time series.

For (2) - now we have basically TYPE, HELP and UNIT all available for
transmission if we wanted to send it with every single datapoint. While I
think
we should definitely examine HPACK like compression features as you
mentioned
Björn, I think we should think more about separating that kind of work into
a
Milestone 2 where this is considered. For the time being it's very
plausible
we could do some negotiation of the receiving Remote Write endpoint by
sending
a "GET" to the remote write endpoint and seeing if it responds with a
"capabilities + preferences" response, and if the endpoint specifies that
it
would like to receive metadata all the time on every single request and let
Snappy take care of keeping size not ballooning too much, or if it would
like
TYPE on every single datapoint, and HELP and UNIT every DESIRED_SECONDS or
so.
To enable a "send HELP every 10 minutes" feature we would have to add to
the
datastructure that holds the LABELS, TYPE, HELP and UNIT for each series a
"last sent" timestamp to know when to resend to that backend, but that
seems
entirely plausible and would not use more than 4 extra bytes.

These thoughts are based on the discussion I've had and the thoughts on
this
thread. What's the feedback on this before I go ahead and re-iterate the
design
to more closely map to what I'm suggesting here?

Best,
Rob

On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein  wrote:

> On 03.08.20 03:04, Rob Skillington wrote:
> > Ok - I have a proposal which could be broken up into two pieces, first
> > delivering TYPE per datapoint, the second consistently and reliably HELP
> and
> > UNIT once per unique metric name:
> >
> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
> > /edit#heading=h.bik9uwphqy3g
>
> Thanks for the doc. I have commented on it, but while doing so, I felt
> the urge to comment more generally, which would not fit well into the
> margin of a Google doc. My thoughts are also a bit out of scope of
> Rob's design doc and more about the general topic of remote write and
> the equally general topic of metadata (about which we have an ongoing
> discussion among the Prometheus developers).
>
> Disclaimer: I don't know the remote-write protocol very well. My hope
> here is that my somewhat distant perspective is of some value as it
> allows to take a step back. However, I might just miss crucial details
> that completely invalidate my thoughts. We'll see...
>
> I do care a lot about metadata, though. (And ironically, the reason
> why I declared remote write "somebody else's problem" is that I've
> always disliked how it fundamentally ignores metadata.)
>
> Rob's document embraces the fact that metadata can change over time,
> but it assumes that at any given time, there i

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-06 Thread Rob Skillington
Hey Callum,

Apologies missed your response as was typing back to Björn.

Look forward to seeing your document, sounds good. As I mentioned in
my previous
email I think that there's definitely "further work" area. I'd like to get
just
TYPE and if it's not too difficult then HELP and UNIT at least too flowing
sooner than that timeline however, and we have folks ready to contribute to
work in this space right now.

Would love to hear your thoughts on my latest proposal as sent with the
last
email.

Best,
Rob

On Thu, Aug 6, 2020 at 5:58 PM Rob Skillington  wrote:

> Hey Björn,
>
>
> Thanks for the detailed response. I've had a few back and forths on this
> with
> Brian and Chris over IRC and CNCF Slack now too.
>
> I agree that fundamentally it seems naive to idealistically model this
> around
> per metric name. It needs to be per series given what may happen w.r.t.
> collision across targets, etc.
>
> Perhaps we can separate these discussions apart into two considerations:
>
> 1) Modeling of the data such that it is kept around for transmission
> (primarily
> we're focused on WAL here).
>
> 2) Transmission (and of which you allude to has many areas for
> improvement).
>
> For (1) - it seems like this needs to be done per time series, thankfully
> we
> actually already have modeled this to be stored per series data just once
> in a
> single WAL file. I will write up my proposal here, but it will surmount to
> essentially encoding the HELP, UNIT and TYPE to the WAL per series similar
> to
> how labels for a series are encoded once per series in the WAL. Since this
> optimization is in place, there's already a huge dampening effect on how
> expensive it is to write out data about a series (e.g. labels). We can
> always
> go and collect a sample WAL file and measure how much extra size
> with/without
> HELP, UNIT and TYPE this would add, but it seems like it won't
> fundamentally
> change the order of magnitude in terms of "information about a timeseries
> storage size" vs "datapoints about a timeseries storage size". One extra
> change
> would be re-encoding the series into the WAL if the HELP changed for that
> series, just so that when HELP does change it can be up to date from the
> view
> of whoever is reading the WAL (i.e. the Remote Write loop). Since this
> entry
> needs to be loaded into memory for Remote Write today anyway, with string
> interning as suggested by Chris, it won't change the memory profile
> algorithmically of a Prometheus with Remote Write enabled. There will be
> some
> overhead that at most would likely be similar to the label data, but we
> aren't
> altering data structures (so won't change big-O magnitude of memory being
> used),
> we're adding fields to existing data structures that exist and string
> interning
> should actually make it much less onerous since there is a large
> duplicative
> effect with HELP among time series.
>
> For (2) - now we have basically TYPE, HELP and UNIT all available for
> transmission if we wanted to send it with every single datapoint. While I
> think
> we should definitely examine HPACK like compression features as you
> mentioned
> Björn, I think we should think more about separating that kind of work
> into a
> Milestone 2 where this is considered. For the time being it's very
> plausible
> we could do some negotiation of the receiving Remote Write endpoint by
> sending
> a "GET" to the remote write endpoint and seeing if it responds with a
> "capabilities + preferences" response, and if the endpoint specifies that
> it
> would like to receive metadata all the time on every single request and
> let
> Snappy take care of keeping size not ballooning too much, or if it would
> like
> TYPE on every single datapoint, and HELP and UNIT every DESIRED_SECONDS or
> so.
> To enable a "send HELP every 10 minutes" feature we would have to add to
> the
> datastructure that holds the LABELS, TYPE, HELP and UNIT for each series a
> "last sent" timestamp to know when to resend to that backend, but that
> seems
> entirely plausible and would not use more than 4 extra bytes.
>
> These thoughts are based on the discussion I've had and the thoughts on
> this
> thread. What's the feedback on this before I go ahead and re-iterate the
> design
> to more closely map to what I'm suggesting here?
>
> Best,
> Rob
>
> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein 
> wrote:
>
>> On 03.08.20 03:04, Rob Skillington wrote:
>> > Ok - I have a proposal which could be broken up into two pieces, first
>> > delivering TYPE per datapoint, the second consistently and reliably
>> HELP and
>> > UNIT once per unique metric name:
>> >
>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
>> > /edit#heading=h.bik9uwphqy3g
>>
>> Thanks for the doc. I have commented on it, but while doing so, I felt
>> the urge to comment more generally, which would not fit well into the
>> margin of a Google doc. My thoughts are also a bit out of scope of
>> Rob's design doc