Thanks. What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor for those type and if so what criteria do you use?
Thanks again! On Fri, Jan 13, 2017 at 3:28 PM, David Turner <david.tur...@storagecraft.com > wrote: > We don't use many critical alerts (that will have our NOC wake up an > engineer), but the main one that we do have is a check that tells us if > there are 2 or more hosts with osds that are down. We have clusters with > 60 servers in them, so having an osd die and backfill off of isn't > something to wake up for in the middle of the night, but having osds down > on 2 servers is 1 osd away from data loss. A quick reference to how to do > this check in bash is below. > > hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down | > grep host | wc -l` > if [ $hosts_with_down_osds -ge 2 ] > then > echo critical > elif [ $hosts_with_down_osds -eq 1 ] > then > echo warning > elif [ $hosts_with_down_osds -eq 0 ] > then > echo ok > else > echo unknown > fi > > ------------------------------ > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft > Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943 > <(385)%20224-2943> > > ------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ------------------------------ > > ------------------------------ > *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Chris > Jones [cjo...@cloudm2.com] > *Sent:* Friday, January 13, 2017 1:15 PM > *To:* ceph-us...@ceph.com > *Subject:* [ceph-users] Ceph Monitoring > > General question/survey: > > Those that have larger clusters, how are you doing alerting/monitoring? > Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about > collectd related but more on initial alerts of an issue or potential issue? > What threshold do you use basically? Just trying to get a pulse of what > others are doing. > > Thanks in advance. > > -- > Best Regards, > Chris Jones > Bloomberg > > > > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com