[ 
https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311024#comment-17311024
 ] 

Jan Høydahl commented on SOLR-15300:
------------------------------------

Agree. Last week I was attempting to create a simple generic Prometheus Alert 
Rule to trigger alerts whenever a collection has a shard whose intended 
replicationFactor is not satisfied. Something like
 * Green - all OK: All replicas in all shards have state==active (and 
represented in live_nodes)
 * Yellow - still operational but replicationFactor not satisfied at the moment 
(Would trigger a non-critical alert "Shard N for collection C has a lower 
replicationFactor (A) than configured (B)."
 * Red - no replicas for a shard are active. They may be in any other state 
(Would trigger a critical alert "Collection C is down. Shard N has no live 
replicas. Recovery is in progress).

Currently I cannot find a single metric that can figure this out. I have tried 
compiling various JQ logic on the CLUSTERSTATE data, but it's quite hard to 
combine the configured replicationFactor with the actual in a generic way for 
all replicas in all shards of a collection and fold it into something 
alertable. So very much +1 to improving this situation.

Perhaps this collides a bit with the PRS effort which aims to not touch 
state.json for state changes in replicas... So I don't know..

> Shard "state" flag is confusing and of limited value to outside consumers
> -------------------------------------------------------------------------
>
>                 Key: SOLR-15300
>                 URL: https://issues.apache.org/jira/browse/SOLR-15300
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Solr API (and consequently the metric reporters, which are often used for 
> Solr monitoring) report the shard as being in ACTIVE state even when in 
> reality its functionality is severely compromised (eg. no replicas, all 
> replicas down, or no leader).
> This reported state is technically correct because it is used only for 
> tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. 
> However, this may be misleading and more often unhelpful than not - for 
> constant monitoring a flag that actually reports impaired functionality of a 
> shard would be more useful than a flag that reports a relatively uncommon 
> SPLITSHARD operation.
> We could either redefine the meaning of the existing flag (and change its 
> state according to some of the criteria I listed above), or add another flag 
> to represent the "health" status of a shard. The value of this flag would 
> then provide an easy way to monitor and to alert external systems of 
> dangerous function impairment, without monitoring the state of all replicas 
> of a collection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to