Re: [ceph-users] Use telegraf/influx to detect problems is very difficult
I just briefly peaked into source of module and I suppose it's because main design idea is just to forward existing metrics from ceph core and do not calculate anything. To me it seems most users probably use prometheus which doesn't have this kind of issue. Monitor down is also easy as pie, because it's just "num_mon - mon_quorum". But there is also metric mon_outside_quorum which I have always zero and don't really know how it works. OSD near full will be probably more tricky, you have to use "osd.stat_bytes_used / osd.stat_bytes" and compare it with your own configured value (not metric so not exported) per each OSD. Or you can just watch general cluster health metric (what you should anyway) and rise general alarm in this case. M. On 11. 12. 19 21:18, Mario Giammarco wrote: > Miroslav replied better for us why "is not so simple" to use math. > And osd down was the easiest. How can I calculate: > - monitor down > - osd near full > > ? > > I do not understand why ceph plugin cannot send to influx all the > metrics it has, especially the most useful for creating alarms. > > Il giorno mer 11 dic 2019 alle ore 04:58 Konstantin Shalygin > mailto:k0...@k0ste.ru>> ha scritto: > >> But it is very difficult/complicated to make simple queries because, for >> example I have osd up and osd total but not osd down metric. >> > To determine how much osds down you don't need special metric, > because you already > > have osd_up and osd_in metrics. Just use math. > > > > > k > > > _______ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Miroslav Kalina Systems development specialist miroslav.kal...@livesport.eu +420 773 071 848 Livesport s.r.o. Aspira Business Centre Bucharova 2928/14a, 158 00 Praha 5 www.livesport.eu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Use telegraf/influx to detect problems is very difficult
As I mentioned yesterday here, there is an issue with current scheme of metrics. With current scheme you cannot do simple math like > SELECT num_osd - num_osd_up FROM "ceph_cluster_stats" Instead you will need query like > SELECT (SELECT last("value") FROM "ceph_cluster_stats" WHERE "type_instance" = 'num_osd') - (SELECT last("value") FROM "ceph_cluster_stats" WHERE "type_instance" = 'num_osd_up') which is not supported by InfluxDB. I know this type of queries works perfectly in prometheus or SQL world, but AFAIK you unfortunately cannot easily combine multiple series in InfluxDB. To Mario's issue with alerting - maybe you can try to use kapacitor for alerting purposes. I have no direct experiences with it, but it should be easily controlled via chronograf and could solve your issue. M. On 12/11/19 4:58 AM, Konstantin Shalygin wrote: > >> But it is very difficult/complicated to make simple queries because, for >> example I have osd up and osd total but not osd down metric. >> > To determine how much osds down you don't need special metric, because > you already > > have osd_up and osd_in metrics. Just use math. > > > > > k > > > _______ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Miroslav Kalina Systems developement specialist miroslav.kal...@livesport.eu +420 773 071 848 Livesport s.r.o. Aspira Business Centre Bucharova 2928/14a, 158 00 Praha 5 www.livesport.eu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph-mgr :: Grafana + Telegraf / InfluxDB metrics format
Hello guys, is there anyone using Telegraf / InfluxDB metrics exporter with Grafana dashboards? I am asking like that because I was unable to find any existing Grafana dashboards based on InfluxDB. I am having hard times with creating graphs I want to see. Metrics are exported in way that every single one is stored in separated series in Influx like: > ceph_pool_stats,cluster=ceph1,metric=read value=1234 15506589110 > ceph_pool_stats,cluster=ceph1,metric=write value=1234 15506589110 > ceph_pool_stats,cluster=ceph1,metric=total value=1234 15506589110 instead of single series like: > ceph_pool_stats,cluster=ceph1 read=1234,write=1234,total=1234 15506589110 This means when I want to create graph of something like % usage ratio (= bytes_used / bytes_total) or number of faulty OSDs (= num_osd_up - num_osd_in) I am unable to do it with single query like > SELECT mean("num_osd_up") - mean("num_osd_in") FROM "ceph_cluster_stats" WHERE "cluster" =~ /^ceph1$/ AND time >= now() - 6h GROUP BY time(5m) fill(null) but instead it requires two queries followed by math operation, which I was unable to get it working in my Grafana nor InfluxDB (I believe it's not supported, Influx removed JOIN queries some time ago). I didn't see any possibility how to modify metrics format exported to Telegraf. I feel like I am missing something pretty obvious here. I am currently unable to switch to prometheus exporter (which don't have this kind of issue) because of my current infrastructure setup. Currently I am using following versions: * Ceph 14.2.4 * InfluxDB 1.6.4 * Grafana 6.4.2 So ... do you have it working anyone? Please could you share your dashboards? Best regards -- Miroslav Kalina Systems developement specialist Livesport s.r.o. Aspira Business Centre Bucharova 2928/14a, 158 00 Praha 5 www.livesport.eu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com