Re: [prometheus-users] Re: One Prometheus per Observed System vs. One Prometheus for Everything
On 06/05/2020 20:21, Shay Berman wrote: Hi Stuart Agree with you points. about this section: /"The options which run queries on the "local" Prometheus servers require those services to be available and not too busy - you can have the situation that a query from somewhere else breaks a server because it is too big/too slow. Equally a server being unavailable (down/network issues) will cause a query to fail."/ You didn't mentioned promxy or Thanos query - these could help to avoid failing the whole query if one single prometheus instance does not responding. It could help (or hinder) depending on the failure mode & query purpose. If you are trying a query across multiple sharded servers (e.g. different environments) Thanos/promxy isn't going to help with the missing data. However if you have HA pairs of servers everywhere it can be very useful if a single server has issues. If you have queries which stress a server (either due to amount of timeseries covered or just overall query volume) systems which duplicate queries could in certain situations make things worse - maybe every server is now overloaded. As I say, the exact "best option" very much depends on your particular situation. Is it a single environment in one location, or lots of environments globally? Do you have a single easily defined set of users (dashboards/alerts) or lots of different teams with different needs & requirements (e.g. some needing longer term querying for capacity management, while others are just short term incident management)? Does the way you operate fit into a more hierarchical structure/process (e.g region -> environment -> service -> instance) or are things more "flat"? -- Stuart Clark -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2cba5da7-09bc-12ac-9e6e-c29511a2a5c7%40Jahingo.com.
Re: [prometheus-users] Re: One Prometheus per Observed System vs. One Prometheus for Everything
Hi Stuart Agree with you points. about this section: *"The options which run queries on the "local" Prometheus servers requirethose services to be available and not too busy - you can have thesituation that a query from somewhere else breaks a server because it istoo big/too slow. Equally a server being unavailable (down/networkissues) will cause a query to fail."* You didn't mentioned promxy or Thanos query - these could help to avoid failing the whole query if one single prometheus instance does not responding. On Wednesday, May 6, 2020 at 11:34:24 AM UTC+3, Stuart Clark wrote: > > On 2020-05-06 08:48, Shay Berman wrote: > > Actually I am facing the same situation [1] when dealing with millions > > of time series on single prometheus. > > So I am trying to break it down to smaller prometheus instances(each > > scrap range of targets). > > But then a global view comes in (because you don't want to break you > > existing dashboards that has queries from specific datasource) so > > there are few solutions here: > > 1. Thanos querier link1 [2] and link2 [3] - which also can give you > > long term storage as optional phase. > > 2. promxy [4] which probably lighter(no need sidecar inside prometheus > > pod) but less features like long term storage and deduplication. > > 3. A global prometheus that do just remote_read [5] from all the > > smaller prometheus. So its work but not really well documented. I > > believe #1 and #2 are better. > > 4. Prometheus fedetation [6] - but this has scaling limitation since > > you must scrap subset of the data, it may boom is you have many small > > prometheses. > > There are lots of details around how you operate/want to work that may > help with deciding which method works for you. > > The options which run queries on the "local" Prometheus servers require > those services to be available and not too busy - you can have the > situation that a query from somewhere else breaks a server because it is > too big/too slow. Equally a server being unavailable (down/network > issues) will cause a query to fail. > > Federation removes that limitation, as the "global" queries would only > be handled by the one Prometheus server, with the only load on the > "local" servers being the constant federation requests (which should be > small and predictable). However, as you mention, switching to federation > needs careful design. You would want recording rules in the "local" > servers to aggregate the metrics (e.g. removing instance labels using > sum()) and then match[] selectors that only federate just enough for the > global alerts & dashboards. You may want to split your dashboards to > local & global - local would sit with/query the local servers, and can > give detail (because they are querying the full data), but may have > availability issues & can't query data not held on that server; global > would use the federated data, but cannot give the full per-instance > detail. > > The global storage solutions sit somewhere in the middle. They have the > advantage of not being dependent on local servers for queries. They > equally can store everything, rather than just summaries. However there > is some complexity, and just because you can store everything centrally > & query without having recroding rules to aggregate doesn't mean you > always should - queries will be slow if lots of series/blocks have to be > interrogated. > > > > > > Just sharing my 2cent so far. > > Shay > > > > On Tuesday, May 5, 2020 at 2:05:20 PM UTC+3, Tim Schwenke wrote: > > > >> Hello, > >> > >> did I understood correctly that due to Prometheus being very > >> light-weight (unlike an Elasticsearch) and efficient but having an > >> upper-limit of xx millions of time series per instance it is > >> recommended to have one Prometheus server/container per observed > >> system (may it be an application or set of CI/CD job runners) than > >> to host a single massive Prometheus? > >> > >> On one hand I see the advantage in scraping all my REST APIs across > >> all apps with one Prometheus. On the other hand I also have a ton of > >> application specific metrics I would have to separate with a prefix > >> or so to not lose overview (labels work as well, but I have to pick > >> a metric first to filter for a certain label value). > >> > >> Thanks in advance, > >> > >> Tim Schwenke > > > > DISCLAIMER > > > > The information contained in this communication from the sender is > > confidential. It is intended solely for use by the recipient and > > others authorized to receive it. If you are not the recipient, you are > > hereby notified that any disclosure, copying, distribution or taking > > action in relation of the contents of this information is strictly > > prohibited and may be unlawful. > > > > This email has been scanned for viruses and malware, and may have been > > automatically archived by MIMECAST LTD, an innovator in Sof
Re: [prometheus-users] Re: One Prometheus per Observed System vs. One Prometheus for Everything
On 2020-05-06 08:48, Shay Berman wrote: Actually I am facing the same situation [1] when dealing with millions of time series on single prometheus. So I am trying to break it down to smaller prometheus instances(each scrap range of targets). But then a global view comes in (because you don't want to break you existing dashboards that has queries from specific datasource) so there are few solutions here: 1. Thanos querier link1 [2] and link2 [3] - which also can give you long term storage as optional phase. 2. promxy [4] which probably lighter(no need sidecar inside prometheus pod) but less features like long term storage and deduplication. 3. A global prometheus that do just remote_read [5] from all the smaller prometheus. So its work but not really well documented. I believe #1 and #2 are better. 4. Prometheus fedetation [6] - but this has scaling limitation since you must scrap subset of the data, it may boom is you have many small prometheses. There are lots of details around how you operate/want to work that may help with deciding which method works for you. The options which run queries on the "local" Prometheus servers require those services to be available and not too busy - you can have the situation that a query from somewhere else breaks a server because it is too big/too slow. Equally a server being unavailable (down/network issues) will cause a query to fail. Federation removes that limitation, as the "global" queries would only be handled by the one Prometheus server, with the only load on the "local" servers being the constant federation requests (which should be small and predictable). However, as you mention, switching to federation needs careful design. You would want recording rules in the "local" servers to aggregate the metrics (e.g. removing instance labels using sum()) and then match[] selectors that only federate just enough for the global alerts & dashboards. You may want to split your dashboards to local & global - local would sit with/query the local servers, and can give detail (because they are querying the full data), but may have availability issues & can't query data not held on that server; global would use the federated data, but cannot give the full per-instance detail. The global storage solutions sit somewhere in the middle. They have the advantage of not being dependent on local servers for queries. They equally can store everything, rather than just summaries. However there is some complexity, and just because you can store everything centrally & query without having recroding rules to aggregate doesn't mean you always should - queries will be slow if lots of series/blocks have to be interrogated. Just sharing my 2cent so far. Shay On Tuesday, May 5, 2020 at 2:05:20 PM UTC+3, Tim Schwenke wrote: Hello, did I understood correctly that due to Prometheus being very light-weight (unlike an Elasticsearch) and efficient but having an upper-limit of xx millions of time series per instance it is recommended to have one Prometheus server/container per observed system (may it be an application or set of CI/CD job runners) than to host a single massive Prometheus? On one hand I see the advantage in scraping all my REST APIs across all apps with one Prometheus. On the other hand I also have a ton of application specific metrics I would have to separate with a prefix or so to not lose overview (labels work as well, but I have to pick a metric first to filter for a certain label value). Thanks in advance, Tim Schwenke DISCLAIMER The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by MIMECAST LTD, an innovator in Software as a Service (SaaS) for business. Providing a SAFER and MORE USEFUL place for your human generated data. Specializing in; Security, archiving and compliance. To find out more Click Here [7]. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8786bc23-7002-444a-a6f4-c5cf3314d87e%40googlegroups.com [8]. Links: -- [1] https://groups.google.com/forum/#!topic/prometheus-users/CwDl2uOSRVY [2] https://github.com/thanos-io/thanos/blob/master/docs/components/query.md [3] https://www.youtube.com/watch?v=Iuo1EjCN5i4 [4] https://github.com/jacksontj/promxy [5] https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read