Thanks.

What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor
for those type and if so what criteria do you use?

Thanks again!

On Fri, Jan 13, 2017 at 3:28 PM, David Turner <david.tur...@storagecraft.com
> wrote:

> We don't use many critical alerts (that will have our NOC wake up an
> engineer), but the main one that we do have is a check that tells us if
> there are 2 or more hosts with osds that are down.  We have clusters with
> 60 servers in them, so having an osd die and backfill off of isn't
> something to wake up for in the middle of the night, but having osds down
> on 2 servers is 1 osd away from data loss.  A quick reference to how to do
> this check in bash is below.
>
> hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down |
> grep host | wc -l`
> if [ $hosts_with_down_osds -ge 2 ]
> then
>     echo critical
> elif [ $hosts_with_down_osds -eq 1 ]
> then
>     echo warning
> elif [ $hosts_with_down_osds -eq 0 ]
> then
>     echo ok
> else
>     echo unknown
> fi
>
> ------------------------------
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> ------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ------------------------------
>
> ------------------------------
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Chris
> Jones [cjo...@cloudm2.com]
> *Sent:* Friday, January 13, 2017 1:15 PM
> *To:* ceph-us...@ceph.com
> *Subject:* [ceph-users] Ceph Monitoring
>
> General question/survey:
>
> Those that have larger clusters, how are you doing alerting/monitoring?
> Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about
> collectd related but more on initial alerts of an issue or potential issue?
> What threshold do you use basically? Just trying to get a pulse of what
> others are doing.
>
> Thanks in advance.
>
> --
> Best Regards,
> Chris Jones
> ​Bloomberg​
>
>
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to