Re: [prometheus-users] Re: One Prometheus per Observed System vs. One Prometheus for Everything

2020-05-06 Thread Stuart Clark

On 06/05/2020 20:21, Shay Berman wrote:

Hi Stuart

Agree with you points.

about this section:
/"The options which run queries on the "local" Prometheus servers require
those services to be available and not too busy - you can have the
situation that a query from somewhere else breaks a server because it is
too big/too slow. Equally a server being unavailable (down/network
issues) will cause a query to fail."/

You didn't mentioned promxy or Thanos query - these could help to 
avoid failing the whole query if one single prometheus instance does 
not responding.




It could help (or hinder) depending on the failure mode & query purpose.

If you are trying a query across multiple sharded servers (e.g. 
different environments) Thanos/promxy isn't going to help with the 
missing data. However if you have HA pairs of servers everywhere it can 
be very useful if a single server has issues.


If you have queries which stress a server (either due to amount of 
timeseries covered or just overall query volume) systems which duplicate 
queries could in certain situations make things worse - maybe every 
server is now overloaded.


As I say, the exact "best option" very much depends on your particular 
situation. Is it a single environment in one location, or lots of 
environments globally? Do you have a single easily defined set of users 
(dashboards/alerts) or lots of different teams with different needs & 
requirements (e.g. some needing longer term querying for capacity 
management, while others are just short term incident management)? Does 
the way you operate fit into a more hierarchical structure/process (e.g 
region -> environment -> service -> instance) or are things more "flat"?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2cba5da7-09bc-12ac-9e6e-c29511a2a5c7%40Jahingo.com.


Re: [prometheus-users] Re: One Prometheus per Observed System vs. One Prometheus for Everything

2020-05-06 Thread Shay Berman
Hi Stuart

Agree with you points.

about this section:




*"The options which run queries on the "local" Prometheus servers 
requirethose services to be available and not too busy - you can have 
thesituation that a query from somewhere else breaks a server because it 
istoo big/too slow. Equally a server being unavailable (down/networkissues) 
will cause a query to fail."*

You didn't mentioned promxy or Thanos query - these could help to avoid 
failing the whole query if one single prometheus instance does not 
responding.

On Wednesday, May 6, 2020 at 11:34:24 AM UTC+3, Stuart Clark wrote:
>
> On 2020-05-06 08:48, Shay Berman wrote: 
> > Actually I am facing the same situation [1] when dealing with millions 
> > of time series on single prometheus. 
> > So I am trying to break it down to smaller prometheus instances(each 
> > scrap range of targets). 
> > But then a global view comes in (because you don't want to break you 
> > existing dashboards that has queries from specific datasource) so 
> > there are few solutions here: 
> > 1. Thanos querier link1 [2] and link2 [3] - which also can give you 
> > long term storage as optional phase. 
> > 2. promxy [4] which probably lighter(no need sidecar inside prometheus 
> > pod) but less features like long term storage and deduplication. 
> > 3. A global prometheus that do just remote_read [5] from all the 
> > smaller prometheus. So its work but not really well documented. I 
> > believe #1 and #2 are better. 
> > 4. Prometheus fedetation [6] - but this has scaling limitation since 
> > you must scrap subset of the data, it may boom is you have many small 
> > prometheses. 
>
> There are lots of details around how you operate/want to work that may 
> help with deciding which method works for you. 
>
> The options which run queries on the "local" Prometheus servers require 
> those services to be available and not too busy - you can have the 
> situation that a query from somewhere else breaks a server because it is 
> too big/too slow. Equally a server being unavailable (down/network 
> issues) will cause a query to fail. 
>
> Federation removes that limitation, as the "global" queries would only 
> be handled by the one Prometheus server, with the only load on the 
> "local" servers being the constant federation requests (which should be 
> small and predictable). However, as you mention, switching to federation 
> needs careful design. You would want recording rules in the "local" 
> servers to aggregate the metrics (e.g. removing instance labels using 
> sum()) and then match[] selectors that only federate just enough for the 
> global alerts & dashboards. You may want to split your dashboards to 
> local & global - local would sit with/query the local servers, and can 
> give detail (because they are querying the full data), but may have 
> availability issues & can't query data not held on that server; global 
> would use the federated data, but cannot give the full per-instance 
> detail. 
>
> The global storage solutions sit somewhere in the middle. They have the 
> advantage of not being dependent on local servers for queries. They 
> equally can store everything, rather than just summaries. However there 
> is some complexity, and just because you can store everything centrally 
> & query without having recroding rules to aggregate doesn't mean you 
> always should - queries will be slow if lots of series/blocks have to be 
> interrogated. 
>
>
> > 
> > Just sharing my 2cent so far. 
> > Shay 
> > 
> > On Tuesday, May 5, 2020 at 2:05:20 PM UTC+3, Tim Schwenke wrote: 
> > 
> >> Hello, 
> >> 
> >> did I understood correctly that due to Prometheus being very 
> >> light-weight (unlike an Elasticsearch) and efficient but having an 
> >> upper-limit of xx millions of time series per instance it is 
> >> recommended to have one Prometheus server/container per observed 
> >> system (may it be an application or set of CI/CD job runners) than 
> >> to host a single massive Prometheus? 
> >> 
> >> On one hand I see the advantage in scraping all my REST APIs across 
> >> all apps with one Prometheus. On the other hand I also have a ton of 
> >> application specific metrics I would have to separate with a prefix 
> >> or so to not lose overview (labels work as well, but I have to pick 
> >> a metric first to filter for a certain label value). 
> >> 
> >> Thanks in advance, 
> >> 
> >> Tim Schwenke 
> > 
> > DISCLAIMER 
> > 
> > The information contained in this communication from the sender is 
> > confidential. It is intended solely for use by the recipient and 
> > others authorized to receive it. If you are not the recipient, you are 
> > hereby notified that any disclosure, copying, distribution or taking 
> > action in relation of the contents of this information is strictly 
> > prohibited and may be unlawful. 
> > 
> > This email has been scanned for viruses and malware, and may have been 
> > automatically archived by MIMECAST LTD, an innovator in Sof

Re: [prometheus-users] Re: One Prometheus per Observed System vs. One Prometheus for Everything

2020-05-06 Thread Stuart Clark

On 2020-05-06 08:48, Shay Berman wrote:

Actually I am facing the same situation [1] when dealing with millions
of time series on single prometheus.
So I am trying to break it down to smaller prometheus instances(each
scrap range of targets).
But then a global view comes in (because you don't want to break you
existing dashboards that has queries from specific datasource) so
there are few solutions here:
1. Thanos querier link1 [2] and link2 [3] - which also can give you
long term storage as optional phase.
2. promxy [4] which probably lighter(no need sidecar inside prometheus
pod) but less features like long term storage and deduplication.
3. A global prometheus that do just remote_read [5] from all the
smaller prometheus. So its work but not really well documented. I
believe #1 and #2 are better.
4. Prometheus fedetation [6] - but this has scaling limitation since
you must scrap subset of the data, it may boom is you have many small
prometheses.


There are lots of details around how you operate/want to work that may 
help with deciding which method works for you.


The options which run queries on the "local" Prometheus servers require 
those services to be available and not too busy - you can have the 
situation that a query from somewhere else breaks a server because it is 
too big/too slow. Equally a server being unavailable (down/network 
issues) will cause a query to fail.


Federation removes that limitation, as the "global" queries would only 
be handled by the one Prometheus server, with the only load on the 
"local" servers being the constant federation requests (which should be 
small and predictable). However, as you mention, switching to federation 
needs careful design. You would want recording rules in the "local" 
servers to aggregate the metrics (e.g. removing instance labels using 
sum()) and then match[] selectors that only federate just enough for the 
global alerts & dashboards. You may want to split your dashboards to 
local & global - local would sit with/query the local servers, and can 
give detail (because they are querying the full data), but may have 
availability issues & can't query data not held on that server; global 
would use the federated data, but cannot give the full per-instance 
detail.


The global storage solutions sit somewhere in the middle. They have the 
advantage of not being dependent on local servers for queries. They 
equally can store everything, rather than just summaries. However there 
is some complexity, and just because you can store everything centrally 
& query without having recroding rules to aggregate doesn't mean you 
always should - queries will be slow if lots of series/blocks have to be 
interrogated.





Just sharing my 2cent so far.
Shay

On Tuesday, May 5, 2020 at 2:05:20 PM UTC+3, Tim Schwenke wrote:


Hello,

did I understood correctly that due to Prometheus being very
light-weight (unlike an Elasticsearch) and efficient but having an
upper-limit of xx millions of time series per instance it is
recommended to have one Prometheus server/container per observed
system (may it be an application or set of CI/CD job runners) than
to host a single massive Prometheus?

On one hand I see the advantage in scraping all my REST APIs across
all apps with one Prometheus. On the other hand I also have a ton of
application specific metrics I would have to separate with a prefix
or so to not lose overview (labels work as well, but I have to pick
a metric first to filter for a certain label value).

Thanks in advance,

Tim Schwenke


DISCLAIMER

The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you are
hereby notified that any disclosure, copying, distribution or taking
action in relation of the contents of this information is strictly
prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been
automatically archived by MIMECAST LTD, an innovator in Software as a
Service (SaaS) for business. Providing a SAFER and MORE USEFUL place
for your human generated data. Specializing in; Security, archiving
and compliance. To find out more Click Here [7].

 --
You received this message because you are subscribed to the Google
Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/8786bc23-7002-444a-a6f4-c5cf3314d87e%40googlegroups.com
[8].


Links:
--
[1] 
https://groups.google.com/forum/#!topic/prometheus-users/CwDl2uOSRVY
[2] 
https://github.com/thanos-io/thanos/blob/master/docs/components/query.md

[3] https://www.youtube.com/watch?v=Iuo1EjCN5i4
[4] https://github.com/jacksontj/promxy
[5]
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read