[jira] [Updated] (SOLR-15059) Default Grafana dashboard needs to expose graphs for monitoring query performance

Timothy Potter (Jira) Wed, 23 Dec 2020 09:19:05 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy Potter updated SOLR-15059:
----------------------------------
    Description: 
The default Grafana dashboard doesn't expose graphs for monitoring query 
performance. For instance, if I want to see QPS for a collection, that's not 
shown in the default dashboard. Same for quantiles like p95 query latency.

After some digging, these metrics are available in the output from 
{{/admin/metrics}} but are not exported by the exporter.

This PR proposes to enhance the default dashboard with a new Query Metrics 
section with the following metrics:
* Distributed QPS per Collection (aggregated across all cores)
* Distributed QPS per Solr Node (aggregated across all base_url)
* QPS 1-min rate per core
* QPS 5-min rate per core
* Top-level Query latency p99, p95, p75
* Local (non-distrib) query count per core (this is important for determining 
if there is unbalanced load)
* Local (non-distrib) query rate per core (1-min)
* Local (non-distrib) p95 per core

Also, the {{solr-exporter-config.xml}} uses {{jq}} queries to pull metrics from 
the output from {{/admin/metrics}}. This file is huge and contains a bunch of 
{{jq}} boilerplate. Moreover, I'm introducing another 15-20 metrics in this PR, 
it only makes the file more verbose.

Thus, I'm also introducing support for jq templates so as to reduce 
boilerplate, reduce syntax errors, and improve readability. For instance the 
query metrics I'm adding to the config look like this:
{code}
          <str>
            $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
          </str>
          <str>
            $jq:core-query(5minRate, endswith(".distrib.requestTimes"))
          </str>
{code}
Instead of duplicating the complicated {{jq}} query for each metric. The 
templates are optional and only should be used if a given jq structure is 
repeated 3 or more times. Otherwise, inlining the jq query is still supported. 
Here's how the templates work:

{code}
  A regex with named groups is used to match template references to template + 
vars using the basic pattern:

      $jq:<TEMPLATE>( <UNIQUE>, <KEYSELECTOR>, <METRIC>, <TYPE> )

  For instance,

      $jq:core(requests_total, endswith(".requestTimes"), count, COUNTER)

  TEMPLATE = core
  UNIQUE = requests_total (unique suffix for this metric, results in a metric 
named "solr_metrics_core_requests_total")
  KEYSELECTOR = endswith(".requestTimes") (filter to select the specific key 
for this metric)
  METRIC = count
  TYPE = COUNTER

  Some templates may have a default type, so you can omit that from your 
template reference, such as:

      $jq:core(requests_total, endswith(".requestTimes"), count)

  Uses the defaultType=COUNTER as many uses of the core template are counts.

  If a template reference omits the metric, then the unique suffix is used, for 
instance:

      $jq:core-query(1minRate, endswith(".distrib.requestTimes"))

  Creates a GAUGE metric (default type) named 
"solr_metrics_core_query_1minRate" using the 1minRate value from the selected 
JSON object.
{code}

  was:
The default Grafana dashboard doesn't expose graphs for monitoring query 
performance. For instance, if I want to see QPS for a collection, that's not 
shown in the default dashboard. Same for quantiles like p95 query latency.

After some digging, these metrics are available in the output from 
{{/admin/metrics}} but are not exported by the exporter.

This PR proposes to enhance the default dashboard with a new Query Metrics 
section with the following metrics:
* Distributed QPS per Collection (aggregated across all cores)
* Distributed QPS per Solr Node (aggregated across all base_url)
* QPS 1-min rate per core
* QPS 5-min rate per core
* Top-level Query latency p99, p95, p75
* Local (non-distrib) query count per core (this is important for determining 
if there is unbalanced load)
* Local (non-distrib) query rate per core (1-min)
* Local (non-distrib) p95 per core

Also, the {{solr-exporter-config.xml}} uses {{jq}} queries to pull metrics from 
the output from {{/admin/metrics}}. This file is huge and contains a bunch of 
{{jq}} boilerplate. Moreover, I'm introducing another 15-20 metrics in this PR, 
it only makes the file more verbose.

Thus, I'm also introducing support for jq templates so as to reduce 
boilerplate, reduce syntax errors, and improve readability. For instance the 
query metrics I'm adding to the config look like this:
{code}
          <str>
            $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
          </str>
          <str>
            $jq:core-query(5minRate, endswith(".distrib.requestTimes"))
          </str>
{code}
Instead of duplicating the complicated {{jq}} query for each metric. The 
templates are optional and only should be used if a given jq structure is 
repeated 3 or more times. Otherwise, inlining the jq query is still supported.



> Default Grafana dashboard needs to expose graphs for monitoring query 
> performance
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-15059
>                 URL: https://issues.apache.org/jira/browse/SOLR-15059
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Grafana Dashboard, metrics
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>            Priority: Major
>             Fix For: 8.8, master (9.0)
>
>
> The default Grafana dashboard doesn't expose graphs for monitoring query 
> performance. For instance, if I want to see QPS for a collection, that's not 
> shown in the default dashboard. Same for quantiles like p95 query latency.
> After some digging, these metrics are available in the output from 
> {{/admin/metrics}} but are not exported by the exporter.
> This PR proposes to enhance the default dashboard with a new Query Metrics 
> section with the following metrics:
> * Distributed QPS per Collection (aggregated across all cores)
> * Distributed QPS per Solr Node (aggregated across all base_url)
> * QPS 1-min rate per core
> * QPS 5-min rate per core
> * Top-level Query latency p99, p95, p75
> * Local (non-distrib) query count per core (this is important for determining 
> if there is unbalanced load)
> * Local (non-distrib) query rate per core (1-min)
> * Local (non-distrib) p95 per core
> Also, the {{solr-exporter-config.xml}} uses {{jq}} queries to pull metrics 
> from the output from {{/admin/metrics}}. This file is huge and contains a 
> bunch of {{jq}} boilerplate. Moreover, I'm introducing another 15-20 metrics 
> in this PR, it only makes the file more verbose.
> Thus, I'm also introducing support for jq templates so as to reduce 
> boilerplate, reduce syntax errors, and improve readability. For instance the 
> query metrics I'm adding to the config look like this:
> {code}
>           <str>
>             $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
>           </str>
>           <str>
>             $jq:core-query(5minRate, endswith(".distrib.requestTimes"))
>           </str>
> {code}
> Instead of duplicating the complicated {{jq}} query for each metric. The 
> templates are optional and only should be used if a given jq structure is 
> repeated 3 or more times. Otherwise, inlining the jq query is still 
> supported. Here's how the templates work:
> {code}
>   A regex with named groups is used to match template references to template 
> + vars using the basic pattern:
>       $jq:<TEMPLATE>( <UNIQUE>, <KEYSELECTOR>, <METRIC>, <TYPE> )
>   For instance,
>       $jq:core(requests_total, endswith(".requestTimes"), count, COUNTER)
>   TEMPLATE = core
>   UNIQUE = requests_total (unique suffix for this metric, results in a metric 
> named "solr_metrics_core_requests_total")
>   KEYSELECTOR = endswith(".requestTimes") (filter to select the specific key 
> for this metric)
>   METRIC = count
>   TYPE = COUNTER
>   Some templates may have a default type, so you can omit that from your 
> template reference, such as:
>       $jq:core(requests_total, endswith(".requestTimes"), count)
>   Uses the defaultType=COUNTER as many uses of the core template are counts.
>   If a template reference omits the metric, then the unique suffix is used, 
> for instance:
>       $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
>   Creates a GAUGE metric (default type) named 
> "solr_metrics_core_query_1minRate" using the 1minRate value from the selected 
> JSON object.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15059) Default Grafana dashboard needs to expose graphs for monitoring query performance

Reply via email to